how do I find process that leads to oom?
from illusionist@lemmy.zip to selfhosted@lemmy.world on 01 Oct 23:16
https://lemmy.zip/post/50070016

Since more than a week, each day my mini server has a oom and all containers are getting killed by oom. How do I find the reason?

Bonus: how can they automatically restart again. I have to do it manually.

Edit: I did not change my system that ran for a while without problems.

#selfhosted

threaded - newest

tal@olio.cafe on 01 Oct 23:21 next collapse

OOMs happen because your system is out of memory.

You asked how to know which process is responsible. There is no correct answer to which process is “wrong” in using more memory — all one can say is that processes are in aggregate asking for too much memory. The kernel tries to “blame” a process and will kill it, as you’ve seen, to let your system continue to function, but ultimately, you may know better than it which is acting in a way you don’t want.

It should log something to the kernel log when it OOM kills something.

It may be that you simply don’t have enough memory to do what you want to do. You could take a glance in top (sort by memory usage with shift-M). You might be able to get by by adding more paging (swap) space. You can do this with a paging file if it’s problematic to create a paging partition.

EDIT: I don’t know if there’s a way to get a dump of processes that are using memory at exactly the instant of the OOM, but if you want to get an idea of what memory usage looks at at that time, you can certainly do something like leave a top -o %MEM -b >log.txt process running to get a snapshot every two seconds of process memory use. top will print a timestamp at the top of each entry, and between the timestamped OOM entry in the kernel log and the timestamped dump, you should be able to look at what’s using memory.

There are also various other packages for logging resource usage that provide less information, but also don’t use so much space, if you want to view historical resource usage. sysstat is what I usually use, with the sar command to view logged data, though that’s very elderly. Things like that won’t dump a list of all processes, but they will let you know if, over a given period of time, a server is running low on available memory.

illusionist@lemmy.zip on 02 Oct 13:24 collapse

Thank you, I’ll try to setup a systemd timer with that

probable_possum@leminal.space on 02 Oct 00:17 collapse

  • journalctl shows the oom event
  • Container limits you can limit container resource usage
  • also you can look at container stats to maybe identify the culprit. Or cyclically write the stdout of ps to a file to identify the memory hungry process.
illusionist@lemmy.zip on 02 Oct 13:26 collapse

That’s awesome. I wonder why I haven’t seen this so far. Thank you!