Why source control matters

13 November 2011
Source control, also known as version control and revision control, is the automatic tracking of changes between files. Here I will focus on program source code as an example, but in principle the same ideas can be applied to quite readily to non-programming files as well. In a nutshell, source control automates three important tasks:

In-progress snapshots
Code-base synchronisartion
Disaster recovery

When I started using source control, I tended to spread program code over multiple files, and I tended to work on multiple systems, so it was really code synchronisation that mattered. However experience has taught me how much the automatic handling of all three factors matters,

Ye Olde Manual Approach

Everyone knows this. Make a copy of your file(s) every now and then, and if needed copy them to another system. If multiple files are involved, then you zip or tarball them up, and if really sophisticated you also date-stamped all these archives. The fundamental problem with this is that it is a manual process that requires effort, and under pressure all this goes out the window. To make matters worse, unless you have an automatically backed-up network drive, making off-system backups is not even on the radar.

What really stings is that keeping everyone's code-bases in sync becomes a nightmare. It is all well and good taking the modular approach and developers making new versions of modules available as-is, but that does not work in practice when you have 6 weeks to get out of the door a product that does not have neat modular boundaries. You just get bigger integration-time crunches, and all sorts of things slip through the net.

My CVS days

I cannot remember whether I had messed around with CVS beforehand, but my first large-scale use of it came around the third year of my undergraduate degree. The main driver was my tendency to both spread source code over multiple files (I think my record was a 41-file submission), and my working patterns which were somewhat biased away from always using the same computer. My eventual setup was to run a CVS server on what was then my gateway server, mainly as this was the only machine I knew would be visible from everywhere I worked. My work-flow became CVS checkout/update, do work, and then CVS commit. Working out which files were changed and needed shipping around was done before me, and I got disaster-recovery grade backups done for free.

What comes later is the realisation that making commits becomes an almost subconscious process. Your repository is separate from your work directory so you do not have loads of files such as table2.c or cpu.bak littering the place, and CVS does all the date-stamping for you so it is easy to work out which was the most recent snapshot. You soon also realise that doing a CVS diff with the commit you near-thoughtlessly did 20 minutes ago is a lot nicer process than spending 2 hours working out what change stopped your program working.

CVS's shortcomings

CVS started out as a bunch of shell scripts that extended RCS (a single-file versioning tool) to multiple files so it treats files individually. In fact unless you explicitly made a CVS tag that links the specific revisions of individual files together, there was no easy way to get a full snapshot of all files at a given point in time. CVS also had several technical flaws such as not adhering to ACID principles, file renaming was a crock (if you wanted to keep revision history across the rename, you had to manually fiddle with the repository file layout), and binary files were handled badly.

CVS nevertheless stood the test of time, and since it was pretty much the only open source system in common use for a very long time, there are still a few CVS-based systems out there. One notable example is FreeBSD, which uses update tools such as cvsup, so it still provides CVS exports even though its core repository is now Subversion.

Onto subversion

Subversion started with the goal of do CVS right, keeping CVS's interface yet fixing problems such as its lack of ACID, but it also had a less than easy start. Subversion uses a database as its storage backend, and the one originally used was prone to corrupting. As a result CVS ended up being worse than CVS, as at least with CVS you could salvage some of your files from a trashed repository. I do not recall the exact history, but I think this problem was before Subversion switched to using the FSFS backend.

Subversion basically implements a versioning file-system, in which copies are made very cheap, and every checkin results in a new global revision number. This revision number alone is enough to indicate which version of every file is wanted, and from a build-release perspective this is very useful as this revision number could be used as a build number. This also leads onto subversion's most annoying feature: you cannot assign labels to revisions.

Since copies (even of entire directory hierarchies) are cheap in subversion, it models its branching on the manual copy-to-different-location model of forking. The idea is that your repository root contains sub-directories corresponding to your branches. However this model is also used for tags, whereas many people (like myself) who used CVS previously wanted a way to attach a tag to revision numbers. Subversion allows you to add properties to revision numbers, but provides no way to search/list them. Many moons ago I wrote a load of hackish scripts to add this feature to subversion, and even though it was basically a bit of a crock, I still received a steady stream of update requests for it many years later.

The high water mark of my use of Subversion was using it for my PhD dissertation. At the time I tended to work on my dissertation at locations that did not have internet access, so I concluded the best way to manage my dissertation was to have it all on a Subversion repository on a USB stick. This was partly the result of a loss-of-data scare, which contributed to the whole write-up experience being something I do not look back at fondly.

Centralised versus distributed

A main plank of centralised revision control is continous access to a central repository, and this works well for projects that are either too big for everyone to checkout, or when many of the developers are new to revision control. For all its faults, subversion is at least intuitive (centralised master copy) for people who don't (and likley don't want to) understand stuff like forking and branching. And having a single revision number that is incremented is easier to explan than changeset hashes. Subversion is the high water mark of centralised revision control systems, and i expect it to be around for quite a long time. The ability to check out partial repositories and much better server support seem to be things working in subversion's favour.

Distributed version control is where everyone has their own /local/ repository, and people then push/pull changesets (revisions have less meaning in a distributed system) to each other and/or a central repository. I got into DVCS when i was going for significant periods without access to my central fileserver, and the nice thing is that even for highly experimental changes you are likely to throw away (and hence do not want to check into any master repository) you still get the benefits of version tracking. Only real downside to distributed systems is that they require a decent knowledge of revision control just to get started, and I would not want to go through the pain of forcing DVCS down the throats of people who d not even know what centralised VCS is.

Mercurial vs. Git

When I first paid proper attention to distributed version control systems (DVCS), there were three that appeared on my radar: Git, Monotone, and Mercurial. I cannot remember why I discounted Monotone quite quickly (and to be fair, possibly others such as DARCS), but I did put a lot of thought into both Git and Mercurial. I opted for Mercurial as at the time it seemed to opt for simplicity, whereas Git focused more on its role as a patch handling system for the Linux kernel. In particular it had many fancy features for cooking revisions (e.g. collapsing several changesets into a single commit) which for someone new to distributed CVS were simply over the head. One particular feature of Git that I initially found to be a headache but appreciate in hindsight is the ability to select which changes in the working copy (uncommitted changes in the checked-out code) go into a commit.

The clinching factor is that Git seemed optimised for applying patches rather than hands-on coding, but these days both Git and Mercurial seem to be adopting each others' features. I still have a leaning towards Mercurial as its Windows support is not the self-flagellation I have found Git's Windows client to be, and it does not have the Linux fan-boy baggage.

Concluson

Revision control has come a long way in the last 5 or so years, so much more work-flow patterns are catered for than when I first started with these systems. For individual projects revision control is a great help, but when it comes to commercial projects involving multiple people it is basically essential.