Revision Control Systems

28 March 2010
Although I currently use Subversion, I recently decided to see what other revision control systems are out there. In particular I wanted to find out about distributed revision control as done by Git (and Monotone, Darcs, etc). Although the centralised systems work well for me as I like having a single master copy, I had in the last year ran into situations where distributed source control might have actually been of practical use.

Background

Although people stereotypically use revision control systems in order to keep a long-term track of changes made to programs, I started using CVS for more mundane reasons. Before then I used to use a combination of tar/zip and FTP, but that was a complete pain even at the best of times.

Taking it's-working snapshots: It is rare for any program to always be in a working state while it is being developed. In fact it is not uncommon for a code-base to go through stages of being completely broken, and on occasion things go horribly wrong and you do not even remember what you have changed. The traditional guard against this is to make regular copies of the (the old prog.c, prog2.c, prog2.c.bak, etc).
Off-system & off-site backups: Hard drives die, people do rm -rf ~ dir rather than rm -rf ~/dir, laptops get stolen, motherboard overheat & melt, lightening strikes fry PSUs, etc.
Migrating between systems: As an undergrad I frequently did work on multiple systems, partly due to my then-preference for working in the department during the day. Although this does (unintentionally) solve the above problem of backups, it also introduces the problem of consistency (i.e. are you really working with the latest code, or are you redoing work you did last night).

It was the last point that stung the most, particularly as leaving university labs in the evenings can be quite a rush. However CVS solved all the problems in one go: with CVS a single cvs commit I had my snapshot, it was on a system other than the one I had just been working on, and I did not have to worry about which files needed copying around. A regular cvs update from departmental systems also meant that I had a usable off-site backup, just in case my flat burned down.

But what about versioning?

All the automation meant that if I thought a snapshot/backup was necessary, I just took it without thinking. An unnecessary revision in the repository was no big deal, but being able to undo 20 minutes of work by reverting a file rather than doing an hour of remedial bug-hunting is what shows the benefits of version control.

..and branching?

To me branching only makes sense when you have multiple people working on a project, and some of them are making changes that will leave the code-base in a broken state for some time. If you are the only person working on a project, then it is easier to develop new/experimental features in-situ as new modules using conditional compilation.

To be fair how CVS handled branches did look like a bit of a hack, and at the time it seemed a complicated alternative to forking off a separate project. I suspect this is one of the reasons why Subversion does not distinguish between projects and branches.

Why CVS?

These days the typical reaction to CVS is a variant of "Yuk", and there are good reasons why these days many people would not touch it with a barge pole. However back in 2001, when I first seriously used revision control, CVS was the only realistic thing going. Subversion was still off the radar, and that still had its own issues with stability years later.

CVS started out as RCS (an earlier system for versioning individual files) and a few shell scripts, and although it was later reimplemented as a single binary it did not have any hard guarantees regarding file integrity. It also has a poor security model, and earlier versions handled non-text files particularly badly. However these disadvantages are not an issue if it is only being used over reasonably secure & reliable connections (e.g. private LANs).

Spin forward a few years..

I stopped using CVS several years ago because it was getting a bit too dated for my liking, and (since about early 2007) have moved over to Subversion since the problems that plagued Subversion circa 2004 had been rectified. Subversion does have a few annoyances (not being able to tag revisions being the major one, which otherwise would provide an easy way to distinguish between "milestone" and "backup" commits), but on the whole scale of things they are minor.

I have yet to go to the extreme of keeping my websites in Subversion, but I can foresee doing so in the future. For any programming project that goes beyond throwaway experimentation using 1 source file I now store in a subversion repository. In some cases where I have modified a small part of an otherwise vast code package (e.g. Qualnet or Radiance), moving the repository between machines is less of a pain than moving the source code itself.

The centralisation starts to bite

The problems with Subversion's centralised approach is that it assumes you always store your repositories in one location, which is also (almost) always accessible from anywhere you do programming. When I was writing up (and LaTex-ifying) my PhD dissertation I was in and around London and hence did not have access to my LAN. For this reason I decided to keep the repository on a memory stick, although on the whole is not a good way to handle master copies (it is a lot easier to lose a memory stick than a server). I certainly did not want to relocate repositories that were already kept on my fileserver, as then I would have problems keeping track of repositories. Having a week or so of uncommitted changes, although a pain, was just about bearable.

What is distributed revision control?

Rather than having a central repository that everyone checked out from and submit changes to, everyone keeps a local repository on their own systems. Changes are then exchanged peer-to-peer between each person's repository. This allows an individual developer to have the benefits of being about to commit/branch/merge/rollback at will, even when they do not have an available network connection to other repositories. Only when an actual feature is in a stable state do they communicate the (overall) changes to other people's repositories.

In practice most projects will have a (nominated) master repository somewhere, rather than development being entirely distributed. In this case the master copy only gets stable milestone commits, whereas all the random commits (and abandoned experimental branches) which are not of long-term importance are kept local.

And why do this for single-person projects

The scenario that made me actually try distributed version control was wanting to have revision control from the start, but I was not sure where I wanted to keep the repository. My personal preference is not to put projects into my master repository until I am sure they are of sufficient size and future interest.

One solution I considered was always starting with a local subversion repository and then migrating it to my central repository store, but one problem with this is stale repository paths in checked out working copies. It also has to be done using svnadmin via a shell login.

Conclusion on Git

On the whole I felt that Git did not provide what I was looking for. Git is no doubt much more complex than Subversion, but this extra complexity did not bring any real benefits as it does not target my use case.

The good bits of Git

Separation of commits & master repository synchronisation: Offline versioning is one of the things that made me try Git in the first place, even though being offline while writing code is very rare for me.
Combined import & checkout: One thing I dislike about Subversion is thatsvn import never turns the imported directory into a working copy. I can understand why doing so is not default behaviour, but git init fits my modus operandi a lot better.
Proper tags: Subversion's approach to tags is to create a branch, and then possibly using server-side hooks to stop this "tag" being messed with. I wrote SVN-Label partly in response to this, and embarrassingly more people asked about this proof-of-concept hack than any other program.

..and the annoying bits

No remote command-line browsing: I tend to keep several projects in the same repository so it is possible to browse all of my projects without downloading them. For instance:
svn ls svn://fileserver/LinuxProj
svn ls svn://fileserver/WinProj/NotepadCE; As an aside I know it would be better to have an 'index' repository that uses svn:external to link to the other, but that would break if I accessed it via svn+ssh rather than svnserve. The problem is that the nearest thing Git provides is listing the branches:
git ls-remote git://fileserver/myProj; This is irritating as a main reason why I have a central repository is to make it easy for me to find past projects, perhaps only grabbing 1 or 2 specific files/directories.
Copying repositories to server is an admin process: This is the killer. Although it is possible to use branches as project identifiers, it is not the intended use of branches:
git push git://fileserver/test1 master:test1; The documented way of adding a repository to a central server is to make a local bare clone of it, then upload this clone to the server. This means I have not gained anything compared to using svnadmin to merge separate respositories.
Authentication system: Whereas svnserve has a basic authentication mechanism, the Git protocol itself does not do authentication at all (it assumes that is dealt with by SSH or mod_dav). Even on my (secure) private LAN I prefer to have at least some form of authentication, although this is intended as a safeguard against screwed-up configurations rather than a serious security measure.
One project per repository: One thing I miss from CVS is the idea of separate projects all in the same repositories, as it allows new project to be created on a central server without having to use admin programs. However since Git (and, with the exception of Monotone, every other distributed revision control system I've seen) ties together local repositories & working copies, I do not really see how it would fit into Git's model.
To be fair Subversion's model of presenting everything as a version-controlled filesystem is also one-project-per-repository. However unless you are picky about projects drawing from the same pool of revision numbers, this is a non-issue.
Windows support: There is not yet a production non-Cygwin Windows client, although I suspect msysgit will solve that problem before long. Bit of a killer as I happened to be working on a Windows Mobile project at the time, and there are no Linux-based Windows Mobile 6.x emulators.

What about other revision control systems?

I think distributed source control is not really the solution to my problem, as my setup is fundamentally still centralised, and distribution tries to solve a different problem. I did give Darcs & Monotone a look as well, but their approach did not seem fundamentally different from Git. What I really want is off-line access via local caching of commits, and for this SVK and git-svn (Git's Subversion frontend) are possible fits.