RFR: JDK-8291878: NMT: Diagnostic malloc limits [v2]

Mon Aug 8 08:17:54 UTC 2022

On Sat, 6 Aug 2022 12:14:45 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> This PR introduces malloc limits, similar to what `MallocMaxTestWords` was intending. `MallocMaxTestWords` is broken, but this one works fine since it is based on NMT. If this one is in, I'd like to remove `MallocMaxTestWords` or, if people really care, redirect it to the new switch.
>> 
>> ----
>> 
>> Why this is useful:
>> 
>> We recently analyzed [JDK-8291919](https://bugs.openjdk.org/browse/JDK-8291919), a jdk11u-specific regression that caused a compiler arena to explode. We used to have such problems in the past a lot, when our CPU ports were young. They are rarer nowadays but still happen.
>> 
>> A switch to limit compiler-related mallocs would have been nice: something to cause the VM to stop with a fatal error in the compiler allocation path when the compiler arena size reached a certain point. I first tried `MallocMaxTestWords`, but that turned out to be broken since it does not de-account memory allocations.
>> 
>> We finally managed to get a retry file by reproducing the bug locally and ulimit-ing the virtual process size, but it was cumbersome. A simple switch like `MallocMaxTestWords` would have been much better.
>> 
>> In addition to customer scenarios like these, such a switch could be used to add sanity checks to compiler jtreg tests. Maybe we could have caught [JDK-8291919](https://bugs.openjdk.org/browse/JDK-8291919) before shipment.
>> 
>> -----
>> 
>> How it works:
>> 
>> Patch introduces a new diagnostic switch `-XX:MallocLimit`. That switch can be used in two ways:
>> 
>> 1 impose a total global limit to the size hotspot is allowed to malloc: 
>> 
>> -XX:MallocLimit=<size>
>> 
>> 2 impose a limit to a selected NMT category, or to multiple NMT categories: 
>> 
>> -XX:MallocLimit=<category>:<size>[,<category>:<size>...]
>> 
>> 
>> If the switch is set, and the VM mallocs more in total (1) or for the given category (2), it will now stop with a fatal error. That way we can e.g. limit compiler arenas to a certain maximum in situations where the compiler runs amok, and get a compiler retry file. See here, with an artificial compiler bug introduced:
>> 
>> 
>> thomas at starfish$ ./images/jdk/bin/java  -XX:NativeMemoryTracking=summary -XX:+UnlockDiagnosticVMOptions -XX:MallocLimit=compiler:1g -jar /shared/projects/spring-petclinic/target/spring-petclinic-2.5.0-SNAP
>> SHOT.jar
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (mallocTracker.cpp:146), pid=519822, tid=519836
>> #  guarantee(false) failed: MallocLimit: category "Compiler" reached limit (size: 1073765608, limit: 1073741824) 
>> #
>> ...
>> # An error report file with more information is saved as:
>> # /shared/projects/openjdk/jdk-jdk/output-release/hs_err_pid519822.log
>> #
>> # Compiler replay data is saved as:
>> # /shared/projects/openjdk/jdk-jdk/output-release/replay_pid519822.log
>> #
>> 
>> -----
>> 
>> The patch:
>> - adds the option and its handling to NMT
>> - adds regression tests.
>
> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix Test

Hi David,

> I can see there is some benefit for artificially forcing an OOM condition, but at what cost? What is the overhead of doing this?

I ran an artificial scenario - 100 threads doing 64mio allocations in a tight loop - with NMT in summary mode, three runs each 5 cycles. Note that this is a very exaggerated scenario that we will never encounter in the wild.

                                                      		average seconds   overhead
Stock JVM, NMT=summary: 						17,42

modified JVM, NMT=summary, malloc limit off: 				17.19             not measurable

modified JVM, NMT=summary, malloc limit per category (internal): 	17.16             not measurable

modified JVM, NMT=summary, malloc limit global: 			28.34             65% 

The overhead if NMT is disabled, or if NMT is enabled but MallocLimit is unused, is zero. If used in its per-category form (`MallocLimit=<category>:<limit>`) the overhead is so small its not measurable. 

It gets much more expensive when used in its global form (`MallocLimit=<limit>`).

Explanation:

NMT already keeps malloc counters, two per category (of which there are 26), and updates one or two of them atomically on each malloc. This happens in summary mode. So we already pay for the cost of counter updates. MallocLimit just piggybacks on that, comparing the counter values to the (runtime constant) per-category limit.

If used in its global form (`MallocLimit=<limit>`), we need to compare it against the total used malloc size. NMT does not keep a total malloc count. So that count has to be calculated by adding all category counters. That is expensive.

I can solve that. I am working on a follow-up patch that reduces the cost for the global limit to almost zero. But I was hesitant to include it in this PR since it would make the patch more difficult to review. I was planning it as a follow-up, reasoning that the cost for a new feature does not matter so much since it is not used by default. What do you think? Should I expand this patch to reduce the costs? 65% sounds bad :-( 

If I do a bigger patch, I may just as well remove MallocMaxTestWords as part of the same change.

> It mentions this needs NMT but I don't see anything that rejects the option if NMT is not present. Also not clear if it needs NMT to be enabled to a specific level of detail, or simply present.

You are right, I need to clarify this.

NMT needs to be enabled, but summary mode suffices. We need it because we need to know the malloc size when free is called, and only NMT keeps the size in its malloc headers. See also https://bugs.openjdk.org/browse/JDK-8058897, where Dan Daugherty mentions this in the comments.

--

Side note, about NMT costs in general. I recently [looked into that](https://stackoverflow.com/questions/73126185/what-is-overhead-of-java-native-memory-tracking-in-summary-mode/73175766#73175766) since our customers like to keep NMT on in production scenarios. The atomic counters NMT keeps can hurt performance, but only in scenarios that are pathological anyway, or at least are rare corner cases. Should we ever decide that this needs to be improved, we could reduce these costs by buffering the counter updates thread-locally. But so far the way hotspot uses malloc seems not to justify the added complexity. Under normal conditions, malloc is not that hot.

-------------

PR: https://git.openjdk.org/jdk/pull/9778