RFR: JDK-8293313: NMT: Rework MallocLimit [v4]

Thu Jan 26 15:43:21 UTC 2023

On Thu, 26 Jan 2023 07:31:01 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> The entire point of this effort is to catch elusive OOM errors. You even moved the check for the memory limits before we actually allocate the memory, from the current way where we check the limit only after we acquire new memory.
>> 
>> Isn't assuming the worst case scenario, where realloc(), might allocate new memory first, in the same vein as that and we are just being cautious, to catch those hard to reproduce bugs?
>> 
>> Basically we are saying that if the OS had to allocate new memory on top of the old one and doing so would exceed the limit we would like to know about it. Will it happen often? No? Will it happen during the lifetime of a memory intensive process? Most likely yes.
>> 
>> These cases might not occur very often, granted, but why not catch them if we can?
>> 
>> Aren't there any platforms where we run Java where realloc() is implemented in terms of malloc() and memcpy()?
>
>> The entire point of this effort is to catch elusive OOM errors. You even moved the check for the memory limits before we actually allocate the memory, from the current way where we check the limit only after we acquire new memory.
>> 
>> Isn't assuming the worst case scenario, where realloc(), might allocate new memory first, in the same vein as that and we are just being cautious, to catch those hard to reproduce bugs?
>> 
>> Basically we are saying that if the OS had to allocate new memory on top of the old one and doing so would exceed the limit we would like to know about it. Will it happen often? No? Will it happen during the lifetime of a memory intensive process? Most likely yes.
>> 
>> These cases might not occur very often, granted, but why not catch them if we can?
>> 
>> Aren't there any platforms where we run Java where realloc() is implemented in terms of malloc() and memcpy()?
> 
> I think its just wrong. False positives are not helpful.
> 
> 1) its semantically wrong. The malloc limit shall triggers when our global or category-specific malloc load exceeds a given threshold. Be it either the malloc that just happened (old version) or the malloc that we are about to do (new version). But I don't want the mechanism to second-guess what I'm telling it and proactively trigger alerts. Example: I set 10m as limit and have a 5MB buffer. Expanding this buffer to 5.5 MB should not trigger any alert. I don't want false positives, they are not helpful.
> 
> 2) You try to predict an OOM kill caused by a malloc event, right? For that to work the prediction must be at least somewhat correct. But you only have very limited knowledge. Some examples:
> - The allocation request is fulfilled from cached memory. No new memory needed at all. The smaller the allocation the more likely this is, and if the libc uses sbrk exclusively, this is very likely.
> - The libc always has overhead per allocation. E.g. glibc, at the very least, ~40ish bytes. Should I account for this? With many fine-grained allocations this can have an enourmous impact. But overhead depends on a number of factors, and every libc is different.
> - The allocation request is large and the libc decides to mmap it (not all do). So we allocate in page granularity. Should I now adjust my prediction for page size? What page size? The libc may still reuse already mapped memory though. Or the underlying kernel vm may do that.
> - The realloc is done by a new thread, and the libc uses thread-specific memory pools. A new memory pool is created. Is this memory committed immediately (e.g. older glibcs) or committed on demand? Depending on the answer, we may see a large bump in RSS or none at all. Depends on libc and libc version.
> - The realloc is large and cannot be enlarged in place. But I could think of several ways of enlarging such an allocation out-of-place without needing memory * 2. If even I can think of those, libc devs probably do too.
> 
> So, these examples show that an allocation may not move the needle at all or may move the needle much farther than you think. Trying to guess this just makes the code complex and causes false positives that are difficult to argue about. All without having any real benefit.

Thank you for taking the time to explain.

I'm worrying about those hard to pin mysterious issues that we have plenty of that possibly could be related to memory pressure, but I understand your point of view and will defer to you (even if there is a 5% of this that continues to bother me).

-------------

PR: https://git.openjdk.org/jdk/pull/11371