I'm not convinced that the fork()/exec() issue is the correct justification, at least not entirely. As an example, Solaris supports fork()/exec() yet it doesn't do overcommit (at least not by default, I think) and has no OOM killer. And Solaris has a long history of being a very robust OS.
Also, in most cases I don't see what is the problem with reserving more disk space for swap (disk space which would almost never be used anyway), especially if the system becomes more robust as a result of that.
I think I recall Linus justifying memory overcommit by observing that the vast majority of applications don't handle out-of-memory conditions gracefully, so it's better for the OS to kill a misbehaving process (that is allocating too much memory) and let the others continue working normally than have every single process in the system have to handle out-of-memory failures, which they don't usually handle gracefully. But I'm not sure if I'm recalling this correctly.
I don't think either the overcommit design or the naive "all allocated address space must be reserved" design is the correct approach. I know almost nothing about Windows memory management, but it sounds to me like having "commit/uncommit memory" system calls separate from the "allocate/deallocate address space" system calls to be a better approach.
But I think the OS should try to reserve more space for the swap file before it starts returning ENOMEM to the "commit memory" system calls like Windows seems to be doing (judging from the blog post)...
> Fortunately, this is tunable. You can also turn off overcommit entirely.
I know, but as soon as I tried doing that, my systems started to experience unavoidable failures (although I don't remember exactly why, but I certainly don't think it was simply due to lack of swap space).
Again, I don't remember exactly, but I suspect I experienced these failures because there are applications that expect Linux to be doing overcommit, or at least, they couldn't work without overcommit unless Linux added new features.
I could be wrong (I'm not exactly a Linux memory management expert), but I think the root issue is that Linux is handling committed memory badly, because when you disable overcommit, Linux tries to reserve swap space for the entire allocated address space of all processes. But actually, this is not needed -- it's way too pessimistic and is extremely likely to lead to unnecessary memory allocation failures.
What is needed is a way for applications to communicate with the kernel about which parts of the address space they might actually use (or not use) at any point in time, which may be significantly less (by orders of magnitude) than the amount of address space that they have allocated.
Then you wouldn't need to reserve swap space for all the address space that processes have allocated, you'd only need to reserve swap space for the amount that the processes have declared to be (possibly) using.
This could have a performance cost, but there are ways to reduce it, for example, by allowing this information to be declared in mmap() as a flag (which avoids doing 2 separate system calls in the typical case), by batching several syscalls into just one (similar to readv()/writev()), by using io_uring(), etc.
I also think that, even if you want to use memory overcommit, then enabling overcommit should be done with (at least) process granularity, not for the whole system at once, which again, greatly limits the usefulness of disabling overcommit.
I know about mmap and mlock. Can you be more specific as to how they are useful in the scenario I mentioned above?
Specifically, when memory overcommit is disabled, I can use mmap() to allocate address space but this causes Linux to unnecessarily reserve swap space for the entire amount of the address space allocation.
This means that if the address space allocation is big enough, it would almost certainly fail even though I would only need to use a very tiny fraction of it.
Do you understand what I mean? Just because I allocate address space, it doesn't mean I will use all of it.
As far as I know, Linux can't handle this problem, because there's no way to communicate to the kernel which chunks of the address space allocation I wish to use.
Which means that disabling overcommit in Linux can be completely useless, because many mmap() calls start failing unnecessarily.
I don't think mlock() has anything to do with this problem.
They ensure that regions of memory are backed by actual memory. This sounds like how to tell the kernel that you're actually need and will use this memory.
mlock() forces pages which have already been allocated and are already being used within an address range to stay resident in memory, so that they are not swapped out to disk. These pages were already being accounted for in swap reservations.
When disabling overcommit, what you'd need to tell Linux is actually quite different: it's that you are about to start to use a new address range, so please make sure that there is enough free swap space for this address range (and reserve it!).
Only after the latter is completed would the kernel be allowed to allocate new pages for this range and the program would be allowed to use them. The kernel would also be free to swap them out to disk like normal, unless mlock() would be called (but only after those pages are already being used, not before!).
So as you can see, mlock() accomplishes something very different and is orthogonal to the functionality I'm discussing, which means it can be used (or not) independently of this new feature to reserve swap space.
This new functionality (to notify Linux that you are about to use a new address range) would also implicitly allow Linux not to reserve swap space for any address range for which it hasn't been notified, which would allow Linux to use swap space much more efficiently and would allow users to disable memory overcommit on Linux without causing a bunch of unnecessary program failures / crashes.
mmap(), on the other hand, normally does two things:
1. It allocates a range of address space.
2. It reserves swap space for this address range (when memory overcommit is disabled).
Notably (and people get confused by this a lot), mmap() doesn't actually allocate memory (i.e. memory pages), it only assigns a range of address space for the program to use, exactly as large as the program requested, and then reserves swap space for it, if required.
What I'm proposing is to separate those two things, i.e. allow programs to allocate address ranges separately from doing the swap space reservation.
If you don't separate these two things then programs allocating a huge range of address space will simply fail due to lack of available swap space (again, when memory overcommit is disabled, which is the goal here).
> But I think the OS should try to reserve more space for the swap file before it starts returning ENOMEM to the "commit memory" system calls like Windows seems to be doing (judging from the blog post)...
One could just as easily argue similarly with respect to Unix and needing to loop over write watching for EINTR et al.
> One could just as easily argue similarly with respect to Unix and needing to loop over write watching for EINTR et al.
I'm sorry, but I fail to see how that is related.
EINTR is useful so that you can handle a signal in case it is received in the middle of a very long system call. For example, if an application is doing a write() on an NFS filesystem and the NFS server is not reachable (due to some network outage), the write() syscall could take minutes or hours before it completes.
So it's good, for example, than you can Ctrl-C the process or send it a SIGTERM signal and abort the syscall in the middle of it, letting the application handle the signal gracefully.
What I'm talking about is not related because allocating disk space on local disk (where the swap file would be located) is generally a very quick process. Mind you, only the disk space allocation is needed for reserving swap space -- it's not necessary to write anything to the swap file. In fact, it's not even necessary to actually allocate disk space, but I digress.
And even if reserving swap space would take a long time, allowing the syscall to fail with EINTR would also work fine.
What is not fine is letting the application believe that the system has completely run out of memory when in fact a lot of disk space can still be used for swap reservation.
That Stack Exchange answer is kind of weird in the sense that it conflates different things. You can have demand paging without overcommitting (as NT does); the kernel simply needs to assure that there is somewhere to commit a particular page, even if that page isn't committed yet.
The answer may not be the best-argued one, but it links to other useful discussions, and it is correct that the forking mechanism is an important factor.
Committing is limited by RAM plus swap. You’d have to reserve much more swap than is typically ever actually used by processes, at any given time.
> You’d have to reserve much more swap than is typically ever actually used by processes, at any given time.
But if the swap wouldn't actually be typically used, then what's the problem with that?
Especially considering that the amount of disk space required for that is cheap.
And why not let the user decide for himself if he prefers to guarantee that their applications never get killed by the OS at the cost of reserving some disk space that he probably wouldn't ever use anyway (as most filesystems' performance nosedives after >90% disk usage anyway).
>most filesystems' performance nosedives after >90% disk usage anyway
With modern filesystems using delayed allocation to reduce fragmentation, and SSDs reducing the cost of fragmentation, you can often get good performance at higher occupancy nowadays.
Sure, I know, but if you're talking about really modern filesystems (which do copy-on-write/CoW), then the fragmentation caused by CoW is even worse than that avoided by delayed allocation.
SSDs certainly alleviate this problem, but even in SSDs, sequential I/O can be much faster than random I/O.
Anyway, I guess my point is that the vast majority of systems don't run with >90% disk space usage, so reserving up to 10% of the filesystem for swap space is not unreasonable.
Note that this would just be a space reservation. You wouldn't need to actually allocate specific disk blocks or write anything to the swap file, unless the system starts running out of memory.
In reality, you'd need much less than 10% (in the vast majority of cases), especially if you have a Windows-like API where you can allocate address space separately from committing memory (which means uncommitted address space doesn't need swap space reservation).