A faulty CPU

09 July 2022
Earlier this year I built myself a new “personal” workstation that was powered by an AMD Ryzen 5 5600X CPU but over the last two months I noticed Linux emitting the occasional hardware error message. Searching for details of these messages gave mixed signals whether they were an actual problem, with everything from them being a sign of a failing part to them spurious warnings. Linux is particularly noisy in reporting such errors with it being echoed to every single open XTerm window which I eventually decided should be treated as a bad sign so I sent the CPU back under warranty for a replacement — I did not know for sure whether it was actually faulty but in the worst-case testing by my vendor as part of the returns process should conclusively show whether the part is actually stable.

Cause for concern

The errors were occurring about once every 2–3 weeks so it was clear these were not random one-off glitches and being mindful of how reports of these errors on the internet were associated with heavy loads, it seemed plausible these errors would be more frequent if I actually stressed the system. I forget which software I used but at some earlier point I did try out some benchmarking software which resulted in an apparent hard-lock although I was unsure whether this was just the X-Windows server being starved of CPU time or whether it was an actual hardware crash.

Message from syslogd@ at Sat May 7 10:47:23 2022 ... : [Hardware Error]: Corrected error, no action required. : [Hardware Error]: CPU:1 (19:21:0) MC20_STATUS[-|CE|-|-|-|-|-|-|-]: 0x9000002b51e95d5b : [Hardware Error]: Coherent Slave Ext. Error Code: 41 : [Hardware Error]: IPID: 0x0000000000000000 : [Hardware Error]: cache level: L3/GEN, tx: GEN Message from syslogd@ at Mon May 30 03:22:33 2022 ... : [Hardware Error]: Deferred error, no action required. : [Hardware Error]: CPU:1 (19:21:0) MC22_STATUS[Over|-|-|-|PCC|-|UECC|Deferred|Poison|Scrub]: 0xd349f9890824448b : [Hardware Error]: Northbridge IO Unit Ext. Error Code: 36 : [Hardware Error]: IPID: 0x0000000000000000 : [Hardware Error]: cache level: L3/GEN, tx: GEN Message from syslogd@ at Sat Jun 4 23:38:27 2022 ... : [Hardware Error]: Corrected error, no action required. : [Hardware Error]: CPU:1 (19:21:0) MC14_STATUS[-|CE|MiscV|-|-|-|-|Poison|-]: 0x894c08c483481d75 : [Hardware Error]: IPID: 0x0000000000000000 : [Hardware Error]: L3 Cache Ext. Error Code: 8 : [Hardware Error]: cache level: L1, tx: DATA Message from syslogd@ at Sun Jun 26 22:28:47 2022 ... : [Hardware Error]: Deferred error, no action required. : [Hardware Error]: CPU:1 (19:21:0) MC19_STATUS[-|-|-|AddrV|-|-|UECC|Deferred|Poison|Scrub]: 0x8548fffffdd2e8ca : [Hardware Error]: Error Addr: 0x0000000000000000 : [Hardware Error]: IPID: 0x0000000000000000 : [Hardware Error]: Coherent Slave Ext. Error Code: 18 : [Hardware Error]: cache level: L2, tx: GEN

One thing that caught my attention with the most recent error message was the errors always being associated with CPU 1 even though the errors themselves were always different — if it was a motherboard error I would expect it not just being CPU 1 reporting the errors, and checking the memory with memtest86 showed the memory SIMMs themselves to be fine. I had read about power supply and cooling issues being a common cause but the power supply unit I got was not an overloaded cheap one, and with the errors being reported at times when the system was under minimal load it was unlikley to be a temperature issue either.

Too long to wait & see

Ultimately I questioned why I should be getting any errors at all on a brand-new CPU I splashed out about £300 getting hold of — let alone at what now looked like at least semi-regular intervals — especially since my other AMD-based system never had any such messages after a year of almost continuous uptime. My previous plan was to wait to see if I got any actual system issues but there are limits as to how long before items can be returned under fitness-for-purpose legislation, and it had already been something close to six months from the time it was ordered. Waiting for an actual crash could well take too long, and since this is a system I needed to have full confidence in the part had to go back.

Removing the CPU

The vendor's verdict

It was barely lunchtime of the day that the vendor's returns department had received the CPU and their ‘live’ status portal was already listing it as “fault found” — digging a little deeper with the invoice number for the replacement part I am guessing testing was over even before the morning coffee break, which made me wonder how on earth I managed to use the CPU for so long without any real trouble. I have no idea what sort of test-bench they have but setting up a CPU with thermal paste and a heat-sink in itself takes a while. The vendor was even good enough to give me Saturday delivery of the replacement although it would be over a week before I got the chance to install it.

Replacement

The test report was very terse at only seven words but it amounted to the following: Prime95 was showing errors; messages were appearing in the Windows system logs; and “crashed intermittently” — undoubtedly there would have been trouble had I done anything seriously CPU intensive with this system. Pretty much the first thing I did after receiving and installing the replacement unit was running MPrime which is the Linux version of the Prime95 utility

Remarks

Linux is unusually tolerant of errors as I have known it to cope with bad memory that Windows immediately failed with, but as it turned out the correct thing here was to treat this sort of error reporting from Linux as something that should not be happening at all with new hardware. If anything I should have returned it earlier but due to the light loading of the workstation it was only now I had seen enough errors to conclude it was more than random glitches. Since I had to almost entirely dismantle the motherboard to get the CPU off I had thought about transplanting it into my old workstation case, but in the end decided not to because my workspace area had already been rearranged to accommodate the taller tower case.

Working system