Fileserver panic & why RAID is a good idea

01 August 2009
One fine morning (well in reality it was 11am and it was raining) I was doing some computer work and for some reason I decided to look at the website for one of my programming projects. As it happened Firefox pulled up the address for the local copy on my fileserver. Unable to connect. Hmm, maybe Apache has died. Time to try SSH. Connection refused. The fans in the fileserver seemed to still be spinning so I hooked up a spare monitor. Some odd message along the lines of IDE not found was onscreen, and since there was no keyboard attached there was little else I could really do other than hit reset.

Ext3-fs INFO: recovery required on readonly filesystem Ext3-fs: write access will be enabled during recovery

This journal recovery notification isn't of any major concern as it is a fairly routine message following unclean shutdowns. However after a few minutes, I got a message about "CPU context corrupted" followed by a kernel panic. Skip ahead past some faffing to me transplanting the SATA drive into a different system..

S.M.A.R.T: BAD - Backup and Replace.

It was gonna be a long day, as I had stuck a lot of stuff onto the fileserver since I last mirrored it. Only good news was successful mounting of the /boot partition, which together with the absence of any unusual drive sounds (I have heard the clicks of death before) suggested that at least I was not dealing with a physical head crash.

By this point it was clear the drive was living on borrowed time and it was time to ghost the partitions. Mercifully trying to remind myself of dd command options made me find dd_restore, and a trip to PC World later (£65 is a bit excessive for a 500GB drive in mid-2009 but at times like this cost is not really a concern) I was busy pulling data off the drive. Seeing it zip along at circa 40MB/sec was at least some initial comfort, but the points where it was stalling and throwing out read errors was not.

As it turned out my main data partition only had around 32 unreadable clusters, and no actual data loss. Upon remounting of the copied image the journal played back fine and I don't think any errors showed up. I think this was due to the low system load that meant that the journal for the data partition was more or less empty with the filesystem very close to a consistent state. In contrast /var seemed to be completely hosed and I did not bother trying to recover it; not too surprising as Linux flushes the system logs to /var constantly. It seems that deliberately putting /var on its own partition when I built the fileserver was a very good move.

Needless to say when I rebuilt the fileserver I opted for RAID mirroring.