Source code control fails30 March 2022
Ever since I first read the Joel Test the one thing I have consistently agreed with is the need for source code version control, but this does rely on proper understanding and buy-in from people on the development teams. Over the years I have seen various projects with repositories setup in such a way that they may as well not use source control at all, and here I am talking about the train-wreck that is spreading software projects with intricate dependencies over multiple repositories. Doing this means that the sort of manual intervention that source code control is meant to avoid is still required, and this is coupled with extra pit-falls that simply not using source code control at all would avoid.
CVS shortcomingsWhen I used CVS at university it was primarily as a backup and synchronisation mechanism for coursework in rather than as a software development aid, but looking back CVS had one massive flaw that could have easily stung me had I made much more sophisticated use of it: CVS considers source files within a project to be independent from each other, and unless tags were used there was little that tied together cross-file changes. Subversion in contrast gave each commit its own revision number and checking out such a revision number made sure all files were updated to the state they were in at the time the commit was made — many projects even used this revision number as part of build numbers and I am pretty certain I at least tried to get a previous company to do this as part of badly-needed automation. Distributed version control systems formalised the idea of changesets which is a self-contained bunch of changes that covers multiple files, but the fundamental failings of CVS helped put a lot of people off version control at all even after much better systems had come a long.
Continuous integration work-flowsPrior to 2013 my use of source code control was pretty unsophisticated, with no real use of branches leading to an entirely linear commit history. It was only after decent exposure to full-on tools-heavy Scrum & Continuous Integration that I properly understood the whole feature-orientated work-flow that Git was designed to support, and why Git at first glance seemed to be a patch management system where “rewriting history” is actually a good thing. They key insight was realising that commits to master should only correspond to complete features and that day-to-day commits made as part of development — almost always ‘backups’ just in case things get broken and need to be rolled back quickly — are of no long-term value. Having got a proper understanding of this work-flow I tried to replicate it in Mercurial but in the long run decided it was >better to abandon Mercurial.
In a multi-contributer project the output of any development is a set of changes to the code, be it an actual commit in a continious integration system or a patch submitted as a contribution to an open source project, and this is a self-contained thing that makes all the changes needed to implement whatever is intended. This is important because when looking back at how something was done then just the single set of changes needs to be consulted, and if it actually breaks stuff then it can be reverted as a single step. Being able to write software is one thing and that is what people either teach themselves or pay a university to teach them, but when it comes to practical commercial software development the ability to properly coordinate parallel development becomes just as important. This is why one person who did initial evaluation of job candidates for his company automatically rejected people who had never used Git.
Using multiple repositoriesOne of the massive fails at a previously mentioned former company of mine was to have each module in its own pair of repositories — one for the interface and the other for the business logic, which from memory was for the dogmatic reason that the interfaces should never change. By the time I left there were something around 40 separate repositories and whenever a cross-module change was needed things ended in tears because it was not possible to atomically submit cross-module changes, and having to temporary nobble unit tests to get such changes passed automated unit tests in stages was an alarmingly regular occurrence. Even when not dealing with cross-module changes at the time I estimated that just contending with the Italian plumbing job of a continuous integration system took up 20-30% of my time, which is both a needless waste of productivity and is pretty demotivating. One of the problems was the amount of automated commits inserted by the automated test & build system, as each time one of these automated commits appeared the code under development needed rebasing.
The need for the test & build system to insert all these metadata tags was pretty much a repeat of the nasty hack that was CVS tags, and this regressive step was needed because the split over multiple repositories threw away the whole base idea of having changesets in the first place: Keeping changes in different places associated with each other. A few years ago I split my single PCB repository into per-project repositories but that started having problems when I started sharing custom symbol and footprint libraries between multiple boards. I had a similar headache when I split my personal PCB repository into multiple sub-repositories and had to take care when I started re-using custom footprints from one. While having a single repository does cause its own set of problems I have concluded that these problems are the lesser evil in the longer run.
More recent headachesThe trigger of the original rant that lead to the drafting of this article was working on a recent project that uses Git's submodule functionality to try and keep separate Git repositories in sync, but this functionality is really intended for import of third-party code that changes little rather than to maintain a parallel set of actively-developed sections of code. Using it for the latter has earned its nickname “sobmodules” due to the ease at which things can get screwed up, and in practice I have had to manually check out each sub-repository. As before the problem with splitting a project over several repositories is that is throws away any real possibility of linking together changes that are dependent on each other, which in turn defeats the whole idea of using version control at all. In this case the architecture of the project is partly down to the decision to have a common core of code that is shared between different versions of the end product, which while understandable in the context in which it came about to me also seems very heavy in technical debt.
This state of affairs with the most recent project is partly due to all the repositories starting life standalone, with one being for the server code and the other being for the client code, but then as the project expanded to include other client devices a section of common code popped out which for better or worse was put into its own repository. However in the longer-term they all need to be merged if there is any realistic prospect of more than one person doing the bulk of the development on the client part, either using a tool like reposurgeon to combine all the revision histories, or more likely just ditching the old history and just creating a new repository from scratch. I am in two minds which is the better solution as in theory it is better to keep the history, but the code also needs the sort of don't look back clean-up where the effort of merging is probably not worthwhile.