Build system fails

14 December 2013
Every company has its own share of mess-ups, such as last year where a mis-manufactured cable led me to burn out a stack of hard drives, but this week has highlighted a right SNAFU with my current company. For obvious reasons I will not mention the company's name or the reason for the politics (client's insistence I suspect) behind what led to this particular incident, but I will say that to me it is a text-book case of what happens when things get too prescriptive.

System overview

In this particular incident is the company's build & source control system, which ironically ticks every box going when it comes to automated builds. I'm not entirely familiar with how the system is wired, but it is based around Git using Gerrit as a code review system, with the actual builds being done by Jenkins. One of the features of the review process is that each submitted changeset goes through various sanity checks, which includes an automated build and running of unit tests. The idea is to stop blatantly broken code from getting into the master code-base, and if this passes, at least one other developer needs to look over and approve the source code changes. A bit long-winded, but a fundamentally sound system.

Distributed? Scattered more like

The massive fail is that the project is spread over about a dozen Git repositories, because for some reason, each separate module has to be kept in its own repository. Most of the time this is not a problem, but recently a lot of new functionality has been stuff that affects multiple modules, and hence requires pushing stuff to multiple repositories. Separate change submissions to different repositories where these changes have dependencies on each other, and each of these has to go through independent build checks against the existing code-base, and very quickly an otherwise reasonably good setup falls down like a pack of cards. The whole point of revision control is to keep track of dependencies, but with the code spread over separate repositories, this is simply not happening properly.

Sum of parts greater than the whole

A contributing problem is that the process of synchronising all the repositories with the central master server, and then doing a full rebuild and full install is very long-winded, and it is clear that significant numbers of developers are not doing this. Pushing & pulling from just the repository that is being directly worked on in itself is not too much of a problem as most development is limited in scope to the relevant module, but when it does matter it seriously strings. Individual developers may have a fully working build, but the one that really matters has the consistency of a vase once it has hit the marble floor. Individuals may have a working system, but the cutting-edge everything up-to-date is a no-hoper that needs manual intervention to build properly, and even then may not even run properly.

To make matters worse, third-party packages have to be approved, and once this is done they need to be added to the system that does the actual build. Said system is overseas, and as far as I can tell adding these packages is a bit of an ordeal. Somehow a changeset with an unsatisfied dependency got successfully merged into master code-base, and then another change that exercised this dependency during automated testing got merged in as well. Getting round this needs a lot of second-guessing of the build system, and that is even before the complication of people oblivious to all of this pushing changes.

Un-wedging the system

In the end the entire changeset submission back-log had to be flushed, and a code freeze had to be called, with individual developers going through cycles of integrating their own code and then fixing any breakages. This included the nobbling of a newly added security system, as well as a few temporary fixes elegantly described by one of the contractors to those further up the food chain as a filthy hack. As a parting gift, the day before the end-of-year demonstrations, the network connection went down. An extra 4-week development sprint was scheduled for next year to clean up the mess.

Upshot?

Over the last six weeks, I estimate that around 30% of my time has been wasted because a fully updated code-base from master is not in a usable state, and this week I think it topped 50%. Given that the testers use build generated automatically each morning, I suspect a lot of them are not getting much work done either. While more diligence on the part of some of the developers would have helped a lot, at the end of the day having a Byzantine setup with loads of nasty edge cases is the ultimate cause of a serious load of lost productivity.