Linux

5174 readers

613 users here now

A community for everything relating to the linux operating system

Also check out [email protected]

Original icon base courtesy of [email protected] and The GIMP

founded 1 year ago

MODERATORS

[email protected]

[Weekly thread] GNU+Linux help: ask anything! (programming.dev)

submitted 3 months ago by [email protected] to c/[email protected]

14 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 4 points 3 months ago* (last edited 3 months ago) (7 children)

I'm getting random reboots, tied to nothing. Micro computer, AMD Ryzen 5 5800H. New (<6mo) computer; no re-used old components. 36GB RAM, which has passed a few runs of memtest. I have regularly seen the k10 temp spike to the low 90s without reboot, and when the reboots happen I haven't noticed that the temps were higher than 60. The only thing I've been able to correlate it at all to is composing email; I'm a fairly fast typer and markdown-oxide goes berserk and consumes in the mid-high 100% CPU use (~165%) while I'm typing. I made the correlation because multiple times this has happened has been while I was composing emails (and subsequently lost them).

There is nothing in boot-1 logs. Just normal logging and then reboot. Nothing at all suspicious, no weird errors. I struggle to use more than 50% memory, so memory contention is not an issue. It's like a sudden power cycle.

The system is on a UPS; my next avenue of investigation is the UPS itself, but power surges in the house shouldn't be a possibility; there are a half dozen other computers in the house, some on UPS, some not, and none of those are having issues.

I saw an article a few days ago about a tool to help track down mysterious reboots like this, but can't find it now. I don't know how software could help; it is literally: everything is working, the screens go blank, and in a second or so the BIOS posts.

I am suspicious of the CPU core temp readings, which I can't seem to get at. I get the GPU temp, which is never stressed (stays around 45C); and k10temp_tctl, which from what I can find is an edge temp and not the core temp; and all of the NVMe temps, which all stay in the 40s. But the fact that I don't know if I'm seeing what's really going on temp-wise in the CPU worries me. But I don't think I've had it crash during a software update, which often includes compiling a bunch of Rust, C, Go, and whatever packages which I can see pegging multiple cores.

I'm at a loss. I've looked at everything I can think of, but still haven't gotten a hint about what is triggering this. I may just do a bunch of markdown editing with markdown-oxide enabled and see of I can reliably force it to happen, but that still wouldn't tell me why. I am certain it's not memory, and have mostly convinced myself it isn't temperature, unless it's something hidden I can't get a reading on.

Help?

Edit it just occurred to me: how do I check for UPS issues when the nut monitor is running on the computer connected to the UPS? If the UPS is stuttering, it's not going to get logged by but. I suppose I could connect a laptop and use it to be the monitor, but this sounds like a lot of work to set up. What else should I try first?

Edit 2 I've now run stress with 16 cores for multiple minutes a couple of times. Once, with -c (busy-work threads), and once with -m (busywork using malloc/free). Both times, gotop showed all 16 cores gratifyingly pegged at 99/100%. Interestingly, k10temp never hit 90C, which I've seen it do before, but today is cool so that's probably helping. With mem-thrashing, I got a bunch of cached memory and finally saw free memory drop to 28%, which I rarely see on this machine because - when I set it up - I was tired of always fretting about memory use and decided to make it a non-issue by maxing the memory with 64GB. Anyway, that's the lowest I've ever noticed free memory drop to. Neither tests crashed the machine. I may try longer runs - a half-hour, maybe? But I'm now suspecting less that it's thermal load related.

[–] [email protected] 3 points 3 months ago (1 children)

Tried updating your bios?

[–] [email protected] 0 points 3 months ago

Started to. There's a small learning curve as I only recently switched from grub to EFI, and am still figuring out how to manage stuff like this.

load more comments (5 replies)