> I don't think either the overcommit design or the naive "all allocated address...

wizeman · on Nov 22, 2022

> Fortunately, this is tunable. You can also turn off overcommit entirely.

I know, but as soon as I tried doing that, my systems started to experience unavoidable failures (although I don't remember exactly why, but I certainly don't think it was simply due to lack of swap space).

Again, I don't remember exactly, but I suspect I experienced these failures because there are applications that expect Linux to be doing overcommit, or at least, they couldn't work without overcommit unless Linux added new features.

I could be wrong (I'm not exactly a Linux memory management expert), but I think the root issue is that Linux is handling committed memory badly, because when you disable overcommit, Linux tries to reserve swap space for the entire allocated address space of all processes. But actually, this is not needed -- it's way too pessimistic and is extremely likely to lead to unnecessary memory allocation failures.

What is needed is a way for applications to communicate with the kernel about which parts of the address space they might actually use (or not use) at any point in time, which may be significantly less (by orders of magnitude) than the amount of address space that they have allocated.

Then you wouldn't need to reserve swap space for all the address space that processes have allocated, you'd only need to reserve swap space for the amount that the processes have declared to be (possibly) using.

This could have a performance cost, but there are ways to reduce it, for example, by allowing this information to be declared in mmap() as a flag (which avoids doing 2 separate system calls in the typical case), by batching several syscalls into just one (similar to readv()/writev()), by using io_uring(), etc.

I also think that, even if you want to use memory overcommit, then enabling overcommit should be done with (at least) process granularity, not for the whole system at once, which again, greatly limits the usefulness of disabling overcommit.

trelane · on Nov 22, 2022

You may find mmap and mlock may be helpful.

wizeman · on Nov 22, 2022

> You may find mmap and mlock may be helpful.

I know about mmap and mlock. Can you be more specific as to how they are useful in the scenario I mentioned above?

Specifically, when memory overcommit is disabled, I can use mmap() to allocate address space but this causes Linux to unnecessarily reserve swap space for the entire amount of the address space allocation.

This means that if the address space allocation is big enough, it would almost certainly fail even though I would only need to use a very tiny fraction of it.

Do you understand what I mean? Just because I allocate address space, it doesn't mean I will use all of it.

As far as I know, Linux can't handle this problem, because there's no way to communicate to the kernel which chunks of the address space allocation I wish to use.

Which means that disabling overcommit in Linux can be completely useless, because many mmap() calls start failing unnecessarily.

I don't think mlock() has anything to do with this problem.

trelane · on Nov 23, 2022

They ensure that regions of memory are backed by actual memory. This sounds like how to tell the kernel that you're actually need and will use this memory.

wizeman · on Nov 23, 2022

mlock() forces pages which have already been allocated and are already being used within an address range to stay resident in memory, so that they are not swapped out to disk. These pages were already being accounted for in swap reservations.

When disabling overcommit, what you'd need to tell Linux is actually quite different: it's that you are about to start to use a new address range, so please make sure that there is enough free swap space for this address range (and reserve it!).

Only after the latter is completed would the kernel be allowed to allocate new pages for this range and the program would be allowed to use them. The kernel would also be free to swap them out to disk like normal, unless mlock() would be called (but only after those pages are already being used, not before!).

So as you can see, mlock() accomplishes something very different and is orthogonal to the functionality I'm discussing, which means it can be used (or not) independently of this new feature to reserve swap space.

This new functionality (to notify Linux that you are about to use a new address range) would also implicitly allow Linux not to reserve swap space for any address range for which it hasn't been notified, which would allow Linux to use swap space much more efficiently and would allow users to disable memory overcommit on Linux without causing a bunch of unnecessary program failures / crashes.

mmap(), on the other hand, normally does two things:

1. It allocates a range of address space.

2. It reserves swap space for this address range (when memory overcommit is disabled).

Notably (and people get confused by this a lot), mmap() doesn't actually allocate memory (i.e. memory pages), it only assigns a range of address space for the program to use, exactly as large as the program requested, and then reserves swap space for it, if required.

What I'm proposing is to separate those two things, i.e. allow programs to allocate address ranges separately from doing the swap space reservation.

If you don't separate these two things then programs allocating a huge range of address space will simply fail due to lack of available swap space (again, when memory overcommit is disabled, which is the goal here).