Code sheriffing @ Mozilla: Past, Present, and Future
In a github world, developers have certain baseline expectations about interacting with source code and the tooling around it. These expectations can color their choices about which projects to contribute to. If Mozilla wants to compete with other companies and open source projects for developer mindshare (and code), we need to evolve the way we develop and distribute software. Code sheriffing and its associated tooling is one piece of that puzzle.
I inherited the Mozilla code sheriff team back in April. I didnāt initially think anything needed to change with sheriffing at Mozilla. Things had been āfineā for a while, so why rock the boat?
By nature, I dug into the history of my new team when I inherited them. What follows is a brief retrospective of sheriffing at Mozilla, the changes weāre undergoing right now, and my vision for how it might change in the future.
Iāve been at Mozilla long enough now to remember when developers themselves acted as code sheriffs. In the beginning, every developer at Mozilla (myself included) rotated through the position1. Some developers were quite conscientious about sheriffing, others never even realized it was their turn. There was no formal training. Not surprisingly, the results wereā¦uneven.
As the number of developers and the volume of code increased, this model became untenable. Code sheriffing as a well-defined role didnāt exist at Mozilla until 2012, initially coming as a response to the staffing increase in the lead-up to Firefox 4. At the same time, Mozilla was moving away from a āstrictā waterfall development model tied to Tinderbox. Our new buildbot-based approach to CI allowed us to land more code, more quickly. Dedicated sheriffs were needed to make sense of it all. Even then, in true Mozilla fashion, sheriffing was an activity that blurred the lines between community and staff. Some of the most dedicated code sheriffs we have ever had were/are volunteers.
Whether staff or community, code sheriffs became de facto stewards of code quality. They were responsible for daily merges, selecting changesets with the lowest number of intermittent failures that would be suitable for inclusion in Nightly releases. When things broke, the sheriffs were responsible for backing out code, and even closing the development trees if the situation became sufficiently dire.
With the opening of the Mozilla office in Taipei, and the associated re-tasking of two QA resources as code sheriffs in that office, Mozilla almost had around-the-clock (24/7) coverage for code sheriffing, provided no one ever got sick or took a vacation.
We persevered in this model for a few years, and our developers understandably became accustomed to the freedom it provided them. Developers could functionally land their code and not worry about the outcome: the code sheriffs would ping them if any follow-up action was required. Fire-and-forget, if you will.
Sadly, in June 2017 our last Taipei sheriff resigned, leaving us with a glaring hole in our coverage. Even with community assistance, there were 8-10 hours per day with *no* active sheriffing. This led to an increase in tree closing events as sheriffs often needed to determine the root cause for a failure that had many commits on top of it already. Complaints started coming in about delays in landing code, and also about classification errors, e.g. permanent failures wrongly triaged as intermittent due to the time pressures of working in this mode. People were not happy, least of all the sheriffs.
This is when I realized I needed to rethink how sheriffing at Mozilla should work.
The knee-jerk reaction would have been to simply hire another sheriff in Taipei, but that still would have left us vulnerable to illness, vacation, and further employment changes. Luckily, another solution presented itself.
Mozilla has an established history of working with SoftVision. I enlisted their help myself a few years ago when I was working in releng to help address our buildduty problem. It came to my attention that SoftVision was creating a 24/7 support service, and I decided to give it a try. Thatās where we are now.
The SoftVision sheriffing contractors started in late August. They have spent the last two months learning (and then practicing) how to classify automation failures. The harder piece is learning how to properly select mergeable changesets and perform backouts. Mozilla guards the kind of source control access required to perform these code sheriffing activities pretty closely; itās not something we simply give away. The contractors are slowly building that trust the same as any other contributor would. Weāre getting there though:
An important milestone today: a SoftVision sheriff backed out their first commit: https://t.co/tcBhp7N7qy #Mozilla
ā Chris Cooper (@ccooper) October 19, 2017
Once the SoftVision sheriffs are fully up-to-speed, they will be available 24/7 to assist developers, and to further the Mozilla mission with the usual array of merges, backouts, uplifts, and tree closures.
Right now, we are relying on the magnanimity of the former sheriffs and community sheriffs to help bridge the gap while the contractors are training up. Itās true, sheriffs throughput is still not back to the level before we lost our sheriff in Taipei, but I can see the light at the end of the tunnel.
How can I be sure that light isnāt a train? Well, thatās the trick, isnāt it?
In retrospect, it was naĆÆve of me to think that sheriffing could have existed for any length of time the way it was. Sheriffs felt enormous pressure to work longer hours than they should have because the trees needed to stay open, and "if not them, then who?" The human toll on those performing the work. whether staff or volunteer, was simply too high.
Yes, for the near-future at least, the SoftVision contractors will continue to perform merges and backouts as required in the model to which weāve become accustomed. That work is still very operational, hands-on, and prone to burnout, and thatās where I think the biggest opportunity for change will come going forward.
Mozilla currently has two integrations branches ā mozilla-inbound and autoland ā in addition to mozilla-central. This makes life much harder for sheriffs because they need to merge code three-ways between the different branches. When bad code gets merged around accidentally, we are almost forced to close the trees while we recover.
The obvious change is to simplify the process and remove one of the integration branches. This might actually be feasible in the near future. With the announcement of Mozillaās adoption of phabricator, 99.9% of code should eventually be able to land directly in the autoland repo, allowing us to decommission the mozilla-inbound repo. Once we return to a single integration branch, developer workflows can be much more streamlined, and streamlined workflows are ideal targets for automation.
My ideal future developer workflow would be:
Developer compiles patch locally.
Patch posted to phabricator, triggers try run automatically.
If try run passes, suitable patch reviewers are selected automatically.
After successful review, patch is landed automatically on the autoland branch.
Autoland gets merged to mozilla-central automatically for changesets below the noise threshold for failures.
There are no code sheriffs in that picture at all. Thatās a good thing.2
Thereās a gulf of tooling improvements between where we are and that potential future, but if Mozilla wants to keep increasing the pace of development and attracting the best developers, I think the tooling investment is one we need to make.
1. Hilariously, a version of that sheriffing calendar still exists, projecting sheriff duty off into the future for a bunch of developers who havenāt even been at Mozilla for years. ā©
2. Iām not naĆÆve enough to think we wonāt need *any* sheriffs. Even Facebookās model still needs some. ā©