Botchups

28 August 2012
Over the last 2 days, I was doing QA on some hardware due for shipping to a company client, and the whole scenario is the best example of what is technically known as a fuckup. The hardware was custom-built servers, and this story centres around the 4-bay Icy Dock drive cages it includes, which are intended to take 1 terabyte (read: not cheap) 2.5" drives. I was just getting onto unit 4 circa-lunchtime, and it was about 24 hours later I finally worked out what was wrong. For obvious reasons how things are described here are focused on conciseness rather than strict chronological accuracy.

Nothing of real concern

At first I suspected a dodgy SATA cable or two, which is not unreasonable as some of the cables had been twisted around a lot in the making of the prototype units. No big issue as the company had about 200 new ones in stock, but having tried replacing one of them, the associated cage bay was still acting as before. Next, may as well check the motherboard, as at least one other motherboard had been flagged as faulty following failed Windows test-installs. However unplugging the SATA cable from the cage and plugging it dirctly into a hard drive directly showed no problem here, so it was clearly neither a cable nor a motherboard fault.

Now getting worried

As a sanity check, I put the server to one side and tested some of the drives in a previously passed server build, and they seemed to be OK. Conversely I decided to try the reverse and, but drives from the passed system did not work in the first (faulty) system, so I suspected a faulty drive cage. Since this was mounted on a removable assembly, it was easily swapped out for a different one, but a new cage did not solve the problems.

By this point I had tried a new drive cage, new SATA leads, and several fresh hard drives. I also knew it was not the motherboard as the individual ports had been tested by plugging a drive in directly. I was wondering whether I should simply condemn the whole system as cursed, but since a Windows had already been activated, I was reluctant to waste the licence key.

My next guess was perhaps a short-circuit somewhere. The laptop-sized drives are bolted into metal caddies, so I thought that the clearance between the base of the caddy and the drive electronics was possibly a bit marginal. It was a long-shot as previous drives of the same model were fine, but the caddies were not the most rigid of designs occasionally needing a bit of work with pliers. Put in some acrylic washers to give an extra half millimetre of clearance, but still the same problems as before. Given the impending delivery date and other tasks piling up, stress was on the up.

The penny drops

A fresh drive straight out of the anti-static bag placed into a caddy was fine when connected directly to the motherboard (caddy design allowed this), but as soon as the caddy went into the drive cage, nothing. To make matters worse, reconnecting the drive directly to the motherboard also gave nothing. Obviously related to the drive cage, but I had already replaced that once. Then it dawned on me the one common component in all tests:

Yep, the pre-made power lead had the +5v and +12v wires swapped, and to complicate matters this only affected 2 of the 4 cage bays. However any drive that was plugged into a bay affected by the miswiring was flash-fried and basically killed immediately. Doing an audit-check of all drives in the shipment, I worked out that 10 drives had been FUBAR'd.