
- 2 months ago
It was a dark and stormy night outside the PCPP Benchmarking lab. Storage devices were being happily benchmarked until the power went out shortly after 1 AM.
Some benchmark machines failed to boot after power outage
When we came in the next morning, the power was back on. We restarted the benchmarks for the storage devices, but noticed three of the benchmark machines would no longer boot. Curiously, all 3 machines behaved identically. When booted, they would display garbled video output as shown:
The keyboard still functioned. Numlock could be turned on and off, and CTRL+ALT+DEL would cause the machine to reboot again. But every boot ended in the same garbled display prior to a chance to enter the BIOS.
Storage devices were causing the boot failures
On all 3 problematic machines, removing the storage device being benchmarked restored normal boot behavior. Reinstalling the storage device would cause boot to fail again. Further, moving a problematic storage device to a new benchmark machine caused the new machine to stop booting.
Other clues and details
Debug Code LED pointed to storage device issues too
When the machines failed to boot, in addition to the garbled display, the Debug Code LED indicated the machine was in the "IDE Detect" stage.
All 3 problematic storage devices were 2TB SMR HDDs
The fact that all 3 problematic storage devices were 2TB SMR Hard Disk Drives was head-scratching. A 1TB SMR HDD had also been running benchmarks during the power outage, but it did not result in any boot issues.
Replacements ordered and problematic disks put aside for later investigation
We ordered replacements for the apparently failed disks, and successfully benchmarked the replacements. Perhaps unstable power had somehow corrupted or damaged the 3 failed devices? The common size and SMR property seemed a bit much to be a coincidence, so we put the failed devices aside to investigate more later.
A SATA SSD helped unlock the mystery
Two weeks later, a new SATA SSD being benchmarked showed some odd S.M.A.R.T data behavior. After investigating, we performed an ATA SECURITY ERASE to reset the drive to fresh out-of-box (FOB) state. This ran for a while and eventually returned an error. The drive was no longer responsive, and a reboot didn't help. So we turned the machine off and back on, and saw something familiar.
Garbled video on boot again
The benchmark machine failed to boot just like the ones that failed weeks prior after the power outage. But this time, there was no power outage involved, and the storage device involved was a SATA SSD, not an SMR HDD.
Garbled video, plus the answer
We installed the SSD on the machine we use to develop and test benchmarking changes, and saw similar garbled output. But this time we also saw a prompt asking us to enter the password for the SSD.
All of our machines in our benchmarking lab run the same BIOS version, but we sometimes run different BIOS versions on machines used for development. This development machine happened to have a newer BIOS version. Despite still showing garbled video output, a portion of the screen was clear enough to see that the system was prompting for a password for the connected storage device.
Connecting a "failed" device on a completely different machine
We then connected the SSD to a system with a motherboard from a different manufacturer, and saw a nice clean image conveying a similar message that the "HDD" was locked and a password was needed.
What had happened to the "failed" devices?
To answer the question of what had happened, we have to first cover how we reset a stateful SATA device (like an SSD or SMR HDD) to reset it to fresh out-of-box (FOB) state.
Running ATA SECURITY ERASE
To reset to FOB state, we run the ATA SECURITY ERASE command on the storage device. Running this command requires first setting a password on the device. Successful completion of the ATA SECURITY ERASE command clears the password.
ATA SECURITY ERASE had been interrupted on the SMR HDDs
For most SSDs, ATA SECURITY ERASE takes a short time (no more than a few minutes). However, for SMR HDDs, the command involves writing data to the entire disk, and takes hours. Reviewing logs, we confirmed that all 3 "failed" SMR HDDs were in the middle of running ATA SECURITY ERASE when the power went out.
ATA SECURITY ERASE failed on the SATA SSD
The problematic SSD did not lose power, but instead had an error occur when running ATA SECURITY ERASE.
Failing to successfully complete ATA SECURITY ERASE leaves the device password protected
The failure of the storage devices to successfully complete ATA SECURITY ERASE, whether due to power outage or an error, meant that the password set prior to running the command was not cleared.
Fixing the problematic devices
Even at the original garbled screen, the PC is actually still responsive, and boot will continue after either entering the correct password, or a handful of failed attempts with a blank password. The machine will then boot into its OS where the password can be cleared to eliminate the issue on future boots.
More details on the password prompt at boot
Most consumer motherboard BIOSes will pause boot and prompt for a password if a SATA password-protected storage device is present. This behavior makes sense if one is trying to boot from such a device, but can get in the way otherwise. It appears few BIOSes have the option to disable this behavior, which means every boot on a system with a SATA password-protected storage device requires interaction before BIOS options or an OS can be loaded.
For us, the garbled video bug during the password prompt turned a minor annoyance into confusing apparent boot failures. We were relieved to discover the issue was actually a straightforward one disguised as something more serious.
Comments