Wikipedia:Reference desk/Archives/Computing/2023 June 25

Computing desk
< June 24 << May | June | Jul >> June 26 >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


June 25

edit

Unusual disk failure mode

edit

This morning I had one of the disks in a RAID5 set take itself down and trigger the building of the hot spare into the raidset. The failing disk was only brought online on 7/1/23. It's a Seagate Ironwolf ST2000VN004-2E41. The disks app on my AlmaLinux system gives an assessment of "Disk is OK, one bad sector" which is also reported by the SMART Data & Self-tests. SMART also claims "Threshold not exceeded". However here things get bad. SMART claims Updated as "53 years, 5 months and 23 days ago" and the Self-test Result as "Unknown()". Only a dash is show against both temperature and Powered On. Trying to start a SMART self-test returns "sk_disk_smart_self_test: Input/output error (udisks-error-quark, 0)".

Going back to the disks display, the size is given as "4.1 GB (4,142,054,400 bytes)", yet as is clear from the type, it is a 2GB 2TB disk! Attempting a format (long or short) fails at once with "Error formatting disk / Error wiping device: Failed to probe the device '/dev/sdg/ (udisks-error-quark, 0)". Trying to look at it by hand is not much more use, fdisk fails at once with "fdisk: cannot open /dev/sdg: Input/output error".

Looking at /var/log/messages The disk is found during the boot and assembled into the raidset. The device mapper builds the and permits the logical volumes to be mounted, but then the error messages start:

  • irq_stat 0x48000008, interface fatal error
  • failed command: WRITE FPDMA QUEUED
  • failed command: READ FPDMA QUEUED

All the above repeated endlessly until eventually the MD system declared it dead and started the rebuild.

I've never seen a disk fail in quite this way, certainly not doubling its capacity and then claiming to have run SMART 53 years ago! I'm guessing that this is now a paper weight but any comments? Martin of Sheffield (talk) 20:18, 25 June 2023 (UTC)[reply]

By 7/1/23 you do mean 2023-01-07, right? Also note that "53 years, 5 months and 23 days ago" appears to be the UNIX epoch. --142.112.221.43 (talk) 22:31, 25 June 2023 (UTC)[reply]
Some folks around here get upset with ISO dates, better to use the standard forms. Good point about the Epoch, I should have spotted that one. I've probably spent too much time recently with OpenVMS where the epoch is 17 November 1858 (or 1858-11-17 if you prefer). Martin of Sheffield (talk) 07:19, 26 June 2023 (UTC)[reply]
The most sensible order for dates is day-month-year in head-initial languages or year-month-day in head-final languages, but I know some use month-day-year. I don't expect month-year-day here, and both day-year-month and year-day-month can be ruled out as there are only 12 months in a year. But to avoid ambiguity, it helps to spell out the name of the month and write all digits of the year. PiusImpavidus (talk) 08:33, 26 June 2023 (UTC)[reply]
2GB sounds a bit small. Google tells me that drive is 2TB and guarantees 3,907,029,168 logical sectors of half a kibibyte each (although physical sectors are 4 kibibytes). May your tool be confusing sectors and bytes? fdisk -l doesn't lie to me. Chances are that no smart test was ever run. PiusImpavidus (talk) 08:33, 26 June 2023 (UTC)[reply]
Well spotted, I've corrected the 2GB -> 2TB, but the odd 4GB reported was a cut-and-paste. It's really weird though, before my initial post I'd rebooted in order to see if that cleared the problem, and it didn't. This morning though the disk has come up normally, though no longer part of the raidset, perhaps the overnight power-off did it. mdadm --examine /dev/sdh1 shows it as it was yesterday prior to going off line. fdisk -l also shows the partition table and finally the disks program sees it correctly and returns SMART data. I run smartctl -t long routinely every month, just after the level 0 backups have run. Also monthly (though the exact timing varies slightly due to anacron) I run /usr/sbin/raid-check. I've just run a basic benchmark and get 131.3 MB/s average read, 82.0 MB/s average write and 15.46ms for average access time. The latter is maybe a little high, but without the history nothing that would stand out. BTW, I'd forgotten about the Yanks' odd way of doing dates while concentrating on the problem, and saw IP:142.112.221.43 as some sort of metric crusader and I've seen enough edit wars between ISO versus other dates in references sections in the past. Martin of Sheffield (talk) 09:40, 26 June 2023 (UTC)[reply]
Intermittent errors sometimes point to a loose cable. BTW, I'm firmly in the metric camp, as my country has been metric for over two centuries. And I never use anything other than day-month-year order. PiusImpavidus (talk) 18:33, 26 June 2023 (UTC)[reply]
I've already reseated the cables, but still get errors although no access is being attempted. New disk on order. :( Personally I always use ISO when programming or scripting and d/m/y in real life. In Wiki articles I'd never use ISO though because it will only trigger off complaints from some users and bots. Martin of Sheffield (talk) 19:02, 26 June 2023 (UTC)[reply]
What was the temperature of the disk when it failed? Ruslik_Zero 20:04, 26 June 2023 (UTC)[reply]
Sorry, not recorded. Martin of Sheffield (talk) 20:35, 26 June 2023 (UTC)[reply]