add arrow-down arrow-left arrow-right arrow-up authorcheckmark clipboard combo comment delete discord dots drag-handle dropdown-arrow errorfacebook history inbox instagram issuelink lock markup-bbcode markup-html markup-pcpp markup-cyclingbuilder markup-plain-text markup-reddit menu pin radio-button save search settings share star-empty star-full star-half switch successtag twitch twitter user warningwattage weight youtube

Storage devices that stop PCs from booting

aggieNick02
aggieNick02
  • 2 months ago

It was a dark and stormy night outside the PCPP Benchmarking lab. Storage devices were being happily benchmarked until the power went out shortly after 1 AM.

Some benchmark machines failed to boot after power outage

When we came in the next morning, the power was back on. We restarted the benchmarks for the storage devices, but noticed three of the benchmark machines would no longer boot. Curiously, all 3 machines behaved identically. When booted, they would display garbled video output as shown:

Image

The keyboard still functioned. Numlock could be turned on and off, and CTRL+ALT+DEL would cause the machine to reboot again. But every boot ended in the same garbled display prior to a chance to enter the BIOS.

Storage devices were causing the boot failures

On all 3 problematic machines, removing the storage device being benchmarked restored normal boot behavior. Reinstalling the storage device would cause boot to fail again. Further, moving a problematic storage device to a new benchmark machine caused the new machine to stop booting.

Other clues and details

Debug Code LED pointed to storage device issues too

When the machines failed to boot, in addition to the garbled display, the Debug Code LED indicated the machine was in the "IDE Detect" stage.

All 3 problematic storage devices were 2TB SMR HDDs

The fact that all 3 problematic storage devices were 2TB SMR Hard Disk Drives was head-scratching. A 1TB SMR HDD had also been running benchmarks during the power outage, but it did not result in any boot issues.

Replacements ordered and problematic disks put aside for later investigation

We ordered replacements for the apparently failed disks, and successfully benchmarked the replacements. Perhaps unstable power had somehow corrupted or damaged the 3 failed devices? The common size and SMR property seemed a bit much to be a coincidence, so we put the failed devices aside to investigate more later.

A SATA SSD helped unlock the mystery

Two weeks later, a new SATA SSD being benchmarked showed some odd S.M.A.R.T data behavior. After investigating, we performed an ATA SECURITY ERASE to reset the drive to fresh out-of-box (FOB) state. This ran for a while and eventually returned an error. The drive was no longer responsive, and a reboot didn't help. So we turned the machine off and back on, and saw something familiar.

Garbled video on boot again

The benchmark machine failed to boot just like the ones that failed weeks prior after the power outage. But this time, there was no power outage involved, and the storage device involved was a SATA SSD, not an SMR HDD.

Garbled video, plus the answer

We installed the SSD on the machine we use to develop and test benchmarking changes, and saw similar garbled output. But this time we also saw a prompt asking us to enter the password for the SSD.

Image

All of our machines in our benchmarking lab run the same BIOS version, but we sometimes run different BIOS versions on machines used for development. This development machine happened to have a newer BIOS version. Despite still showing garbled video output, a portion of the screen was clear enough to see that the system was prompting for a password for the connected storage device.

Connecting a "failed" device on a completely different machine

We then connected the SSD to a system with a motherboard from a different manufacturer, and saw a nice clean image conveying a similar message that the "HDD" was locked and a password was needed.

Image

What had happened to the "failed" devices?

To answer the question of what had happened, we have to first cover how we reset a stateful SATA device (like an SSD or SMR HDD) to reset it to fresh out-of-box (FOB) state.

Running ATA SECURITY ERASE

To reset to FOB state, we run the ATA SECURITY ERASE command on the storage device. Running this command requires first setting a password on the device. Successful completion of the ATA SECURITY ERASE command clears the password.

ATA SECURITY ERASE had been interrupted on the SMR HDDs

For most SSDs, ATA SECURITY ERASE takes a short time (no more than a few minutes). However, for SMR HDDs, the command involves writing data to the entire disk, and takes hours. Reviewing logs, we confirmed that all 3 "failed" SMR HDDs were in the middle of running ATA SECURITY ERASE when the power went out.

ATA SECURITY ERASE failed on the SATA SSD

The problematic SSD did not lose power, but instead had an error occur when running ATA SECURITY ERASE.

Failing to successfully complete ATA SECURITY ERASE leaves the device password protected

The failure of the storage devices to successfully complete ATA SECURITY ERASE, whether due to power outage or an error, meant that the password set prior to running the command was not cleared.

Fixing the problematic devices

Even at the original garbled screen, the PC is actually still responsive, and boot will continue after either entering the correct password, or a handful of failed attempts with a blank password. The machine will then boot into its OS where the password can be cleared to eliminate the issue on future boots.

More details on the password prompt at boot

Most consumer motherboard BIOSes will pause boot and prompt for a password if a SATA password-protected storage device is present. This behavior makes sense if one is trying to boot from such a device, but can get in the way otherwise. It appears few BIOSes have the option to disable this behavior, which means every boot on a system with a SATA password-protected storage device requires interaction before BIOS options or an OS can be loaded.

For us, the garbled video bug during the password prompt turned a minor annoyance into confusing apparent boot failures. We were relieved to discover the issue was actually a straightforward one disguised as something more serious.

Comments

ThePCNerd-RedSide
  • 2 months ago

Wow, that's weird! But also very logical... Why do you need to set a password to reset a drive? I'm not very technical in storage benchmarking, but I've had some headaches with storage before so I've learned some stuff along the way...

aggieNick02
  • 2 months ago

That's a very reasonable question. One theory on superuser is that the password locks the drive so that even if the ATA SECURITY ERASE command fails, the drive will be inaccessible on next boot unless you know the password. But all I could find by looking online were theories/guesses.

The reset process has other wrinkles too. We have to suspend and resume the PC prior to reset (and before the password step) to get around some other edge cases. And the process is different for NVME devices, but still filled with peculiarities.

ThePCNerd-RedSide
  • 2 months ago

Well, makes sense. I see the intended audience for that command being those who want to reset their drive to then dispose of it or sell it: it would absolutely make sense to be 100% sure that there's nothing left on your SSD/HDD. The password would prevent the readability of any content even in case the reset process fails. This is a very likely explanation.

Still, very good job, looking forward to seeing more storage news on here!

PS: by the way, are you guys planning on bringing a wider range of benchmarks (i.e. for whatever other component) or will you stick to storage for now?

aggieNick02
  • 2 months ago

Glad you enjoyed it. And yes, we plan on benchmarks for other components in the future!

ThePCNerd-RedSide
  • 2 months ago

And yes, we plan on benchmarks for other components in the future!

Niceee!!

ThePCNerd-RedSide
  • 2 months ago

One more thing: I haven't really understood why you would get such graphical "artifacts" on video-out when trying to use those password-protected drives. Were you trying to boot from them or were they just connected to the PC? Because, if the former, then it could be understandable in case the BIOS wasn't very well made to handle such conditions, but if the latter, then it would be pretty weird, since nothing was being loaded from that drive (unless all of this was just an issue during POST where the BIOS wouldn't be able to properly identify or confirm the status of the drive).

aggieNick02
  • 2 months ago

This happens just when connected to the PC. SATA boot devices are even disabled in the BIOS boot sequence. We saw this behavior (prompt for password before BIOS splash screen even when booting from SATA is disabled) on motherboards from multiple manufacturers.

I agree it is a weird behavior. From what I could find searching online, the only motherboards that appeared to have the ability to disable the password prompt were some found in Dell systems.

ThePCNerd-RedSide
  • 2 months ago

Oh wow... Ok, seems like some form of protection (a very very annoying form of protection). That still doesn't explain the artifacts during POST though. My theory is that some BIOSes just freak out when a SATA drive is not responding like in this case and they start messing up with video handling.

The surprising thing to me is that y’all don’t have a backup power setup. The Texas grid sucks and you’ve got a business relying on it. I assume the servers for pcpp are elsewhere though. During times nobody’s at the office, is moderation paused?

philip
  • 2 months ago

We have backups for the critical stuff and our dev machines, but not the bench machines. Right now we're running 32 systems for SSD/HDD bench. But we have the space and power already wired to scale up to ~128 when the rest are CPU/GPU. It's just not worth the expense to put those on a UPS. If there's a power outage mid-run, in theory we should be able to just reimage and rerun the test. So all we lose is a bit of time. (Unless of course you have to debug hardware retaining boot-blocking state like discussed in Nick's post.)

The power at the office has been decent. It's not as robust as at my house, but it's an order of magnitude better than the Forest Creek neighborhood in Round Rock (a few miles from the office).

We can moderate from pretty much anywhere - it's not constrained to the office. (And we host the site on AWS rather than on-prem.)

Makes sense. I live kind of close, and the power flickers are irritating. My ups logs multiple every month.