I woke up in the morning, got to the desk in my home office, checked my email, discord, and the news. Then I switched from my desktop to my laptop and... there's no internet.
That's weird. I just browsed the net on my PC, so what's up with the laptop? Both are connected to the same network, so it's not the problem of the network not having connectivity. As such, the problem lies between my ISP's modem and the laptop (inclusive).
I started with disconnecting and reconnecting the ethernet network cable (it's a pretty stationary laptop, so I keep it wired). That didn't fix anything, apart from displaying a short spinning animation indicating it's trying to get an IP address assigned (a DHCP issue then?). Just to be sure it's nothing on the laptop side I did a reboot, and then power-cycled the nearest network switch for good measure as well. No luck.
Following up on the DHCP lead I logged into my home server, which runs the DHCP daemon... wait... what is this?
ssh: connect to host home server port 22: No route to host
So I moved the chair a bit to check my server rack, and found the home server dark. That's unusual. On closer inspection actually the LEDs on the motherboard next to the power/reboot buttons were lit. A minor explanation here: I use customized Open Benchtable mounts, so the mobo is easily accessible; at the same time it means there are no power/reboot buttons on the case – as there is no case – so I rely on mobos having power/reboot buttons directly on them (or, failing that, small buttons-on-PCBs that you hook into the normal case button connector on the mobo).
I clicked the power button, and... even the two last LEDs went dark. Not great. They did light back up a few seconds later though, so re-tried a couple of times, with the same result. The closest I got to a "fully functional and running server" was the CPU fan spinning up for 0.5 seconds.
At this point I had good news and bad news:
- Good news: I found the problem! DHCP server is down because...
- Bad news: ...the server is dead.
The next step was to turn on some DHCP server in the network so that the Internet actually works in the household, and to let everyone using the server know that there are problems.
By the way...
On 22nd Nov'24 we're running a webinar called "CVEs of SSH" – it's free, but requires sign up: https://hexarcana.ch/workshops/cves-of-ssh (Dan from HexArcana is the speaker).
Of course it's rarely that the whole computer dies – usually it's just one component. As such, the next step was to figure out which component(s) are defective.
The usual algorithm for this is:
- Disconnect power and let it chill for 10-20 seconds (i.e. wait for all/most capacitors to discharge).
- Disconnect all unnecessary peripherals like: all storage devices (HDDs, SSDs), all PCIe cards, all USB devices, etc. Hint: make a couple of photos of what is connected where – even if you keep detailed documentation on the setup (you do, right?), it can save some time.
- Remove all RAM modules apart from one. You basically want to be left only with mobo, CPU, PSU, and one RAM stick (and PC case connectors – these are usually fine).
- Connect power, attempt to turn on the computer.
- If nothing boots at this moment, go to point 1 and try a different RAM module or try putting it in a different RAM slot. Repeat this until you run out of options.
- If you get the computer to boot in this "minimized" state...
- Power everything down (see point 1).
- Add one random device or RAM module from the batch you've disconnected earlier (usually starting with the GPU makes most sense, as that way you get a display later on).
- Connect power, attempt to turn on the computer.
- If things boot, go to point 7 (IMPORTANT: don't do after-POST hard power offs from the moment you connect any storage device).
- If things don't boot, you found the culprit (though it might be either the slot/connector, cable, or actual device; pretty easy to figure out at this point though using a similar approach to the one described below).
In my case I basically run out of options at point 5, which translates to: it's what's left, i.e. the problem lies either in the CPU, the motherboard, the PSU, or ?all the RAM dice? (unlikely, but at the end of the day anything can break). And to figure out which one is it, you have to start taking each of these components and testing them on a different setup AND/OR replacing the component in the debugged setup with a working one (worst case scenario is if doing this causes the good component(s) to also break). This requires one to have at least another (ideally similar) computer – thankfully I have some old hardware lying around.
I've started with hooking up a different PSU, since that's obviously the easiest to swap out, but also the most probable issue. And the "minimized" server actually started normally, with no issues whatsoever! So at that point I was pretty sure it's the PSU, but to double check things I've added all the PCIe peripherals, and... it booted again with no issue. Cool.
Unfortunately it turned out I don't have a PSU I could use as a replacement. While I have some modular PSUs lying around, they either were from a different manufacturer (which would require me to order new modular power cables to hook up all the HDDs), or were from the same company but didn't have all the connectors I needed (to be more exact: the PSU I had was missing one custom "SATA/Molex" PSU connector). So I had to order a new PSU from the same company.
Thankfully I did this debugging in the early morning, so the replacement PSU arrived by post by early evening. After connecting it all back together the home server booted without an issue. So problem solved. All that was left was to disable the temporary DHCP and... write a blog post about it I guess?
While things breaking can be frustrating at times, I do have to say I did enjoy this bit of relatively simple technical work – it was a nice distraction from the paperwork that awaited me for the rest of that day ;)
Comments:
https://secport.pl/index.php/nano-serwer/
Add a comment: