this post was submitted on 03 Jun 2024
16 points (94.4% liked)

Debian operating system

2724 readers
2 users here now

Debian is a free operating system (OS) for your computer. An operating system is the set of basic programs and utilities that make your computer run. Debian provides more than a pure OS: it comes with over 59000 packages, precompiled software bundled up in a nice format for easy installation on your machine.

founded 4 years ago
MODERATORS
16
submitted 5 months ago* (last edited 5 months ago) by Ninguem to c/debian@lemmy.ml
 

My system seems to crash from time to time. I still don't know what causes it.

If I leave it untouched for a few hours, sometimes, it crashes.

To resume, I have to force a reboot by unplugging the power cable (not even pressing the power button for N seconds seems to work).

Then, it seems to work just fine (after displaying some error messages about lost or orphaned inodes at boot). Until, one day, it happens again. When? I never know. It seems to follow some strange and unpredictable pattern.

Where should I start investigating?

top 19 comments
sorted by: hot top controversial new old
[–] Shdwdrgn@mander.xyz 6 points 5 months ago (1 children)

Certainly seems like a hardware issue, but there's no easy answer to this one. It could be your power supply, motherboard, CPU, memory, even the video card. The power-button issue makes me lean towards power supply or motherboard though (assuming you've verified the power button works after a fresh boot).

If you have other parts on hand (even from another running system) you could swap components until you identify the culprit. If you find it's your power supply, make sure you replace it with a decent quality one, NOT one of those $25 units you find everywhere, or you'll have even more problems followed by a rapid failure in another year.

[–] Ninguem 3 points 5 months ago (1 children)

Well... The power button does work to boot the system back up. So...?

It's a small desktop computer Asus VivoMini VM65, not easy to swap parts.

Asus doesn't even have a picture of it anymore. Here's a bunch.

[–] Shdwdrgn@mander.xyz 3 points 5 months ago (1 children)

Then it could still be the power supply or motherboard. It takes a lot to override the hardware power switch and the power supply itself is usually one of the biggest culprits to random lockups. Beyond that I can't offer much, I don't know of any way to test the components unless you could afford thousands of dollars for specialized equipment. I've always had multiple machines available to swap parts so I don't have a different strategy for troubleshooting.

[–] Ninguem 2 points 5 months ago (1 children)

Thanks.

I thought there could be some place in logs the system could write error messages to. Even hardware derived.

I've been going through /var/log/* and found thousand of scary messages, but can't really make sense of any of those.

[–] Shdwdrgn@mander.xyz 1 points 5 months ago

Something like a hard drive could generate recognizable errors in the logs, but once you get out into the motherboard or power supply, there's simply not enough monitoring available to detect problems. You might be able to record the voltage output from the power supply rails, depending on your hardware, but that still might not tell you anything.

[–] XTL@sopuli.xyz 5 points 5 months ago (1 children)

Only when it's idle for a while? That would point to power saving or suspend. You could try changing those to see if it affects the situation.

But yes, hardware. And I would suspect the display controller in particular. I've seen many that glitch when in power saving.

Also, have you checked for bios updates?

[–] Ninguem 2 points 5 months ago (1 children)

Only when it’s idle for a while? That would point to power saving or suspend. You could try changing those to see if it affects the situation.

That's my feeling as well. It's set to suspend after 20mn and the power button in set to suspend, so maybe if I try to suspend and wake back up several times, one of the times it will fail... God! I am really not in the mood for that!... Let's see...

But yes, hardware. And I would suspect the display controller in particular. I’ve seen many that glitch when in power saving.

Hum!...

$ lspci
 00:02.0 VGA compatible controller: Intel Corporation HD Graphics 620 (rev 02)

Also, have you checked for bios updates?

I updated the bios on an asus PN40 before and bricked it, so... didn't try in this one. It's my workhorse. :-|

[–] KISSmyOSFeddit@lemmy.world 3 points 5 months ago (1 children)

How old is that PC? Anything newer than 10 years (and not a bargain bin prebuilt) shouldn't have this issue cause it should have a backup BIOS.

[–] Ninguem 2 points 5 months ago

I can't remember when I bought it, it must be newer than that, but concerning the "bargain bin prebuilt" status... It certainly was cheap. That's why I bought it.

[–] JohnnySledge@lemmy.world 4 points 5 months ago (1 children)

Not to be that guy but have you checked the system logs when this happens?

[–] Ninguem 3 points 5 months ago (1 children)

What system logs do you think I should look first, and what should I be looking for?

Any expression I could grep?

[–] JohnnySledge@lemmy.world 3 points 5 months ago (1 children)

I would check system level logs to start like dmesg and syslog.

Someone else might have better tips of what to grep for in the logs. One thing you could do is try to ssh into the system when it’s locked up and check the logs to see what’s being reported during or immediately before the lockup.

Once you get a sense for what’s going on at a system level you can start to look in more specific logs.

[–] Ninguem 2 points 5 months ago

One thing you could do is try to ssh into the system when it’s locked up and

Good idea. I'll do that next time.

[–] lemming741@lemmy.world 3 points 5 months ago

Others have you on the right track, I'll just mention the magic SysRq key

https://en.m.wikipedia.org/wiki/Magic_SysRq_key

[–] Ninguem 2 points 5 months ago* (last edited 5 months ago)

Thank you all who contributed.

I was hoping for some log that could maybe save some info when in a hurry, sensing an eminent crash or an emergency, but that doesn't really make sense, does it? - If it crashes, it crashes. There's probably no time to dump anything to disk. Otherwise the crash would have been prevented.

Anyway, some paths have been opened for when I have time to dig in a little bit deeper.

I'll quickly find something else to annoy you ASAP. ;-)

[–] Sammirr@aussie.zone 1 points 5 months ago (1 children)

It would be worth looking through your journal. There is usually some amount of logging even in the event of a hardware issue. See journalctl.

As others suggest, does sound related to sleep / suspend. If that is the case, you could have a problem like I did where my motherboard BIOS incorrectly reports usb capabilities which results in an immediate resume from sleep. That abrupt change causes the amdgpu driver to crash, and the system hangs soon after. My workaround is to disable features on the offending usb host controller.

[–] Ninguem 1 points 5 months ago (1 children)

I'll have to take a look at that. But OMG!

lines 1-46/1743603 0%

How many pages are 1743603 of lines? An entire bookshelf?

Besides, I have no idea what all that means!... I guess I'll just have to sit, one day, or month, and go through it all...

[–] tux7350@lemmy.world 3 points 5 months ago

Journalctl has a bunch of filter options that you should take advantage of when troubleshooting. For example, journalctl -b -1 will show you only the messages from the last boot (not the current). I used this article for a quick couple of need to know commands. Enjoy! :D

https://www.loggly.com/ultimate-guide/using-journalctl/

[–] KISSmyOSFeddit@lemmy.world 0 points 5 months ago* (last edited 5 months ago)

If pressing the power button for a long time doesn't work, it's 100% a hardware issue.
Likely related to the power supply.