RFR: JDK-8293313: NMT: Rework MallocLimit [v2]

Mon Jan 16 14:34:18 UTC 2023

On Tue, 6 Dec 2022 07:07:33 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> Rework NMT `MallocLimit` to make it more useful for native OOM analysis. Also remove `MallocMaxTestWords`, which had been mostly redundant with NMT (for background, see JDK-8293292).
>> 
>> 
>> ### Background
>> 
>> Some months ago we had [JDK-8291919](https://bugs.openjdk.org/browse/JDK-8291919), a compiler regression that caused compiler arenas to grow boundlessly. Process was usually OOM-killed before we could take a look, so this was difficult to analyze.
>> 
>> To improve analysis options, we added NMT *malloc limits* with [JDK-8291878](https://bugs.openjdk.org/browse/JDK-8291878). Malloc limits let us set one or multiple limits to malloc load. Surpassing a limit causes the VM to stop with a fatal error in the allocation function, giving us a hs-err file and a core right at the point that (probably) causes the leak. This makes error analysis a lot simpler, and is also valuable for regression-testing footprint usage.
>> 
>> Some people wished for a way to not end with a fatal error but to mimic a native OOM (causing os::malloc to return NULL). This would allow us to test resilience in the face of native OOMs, among other things.
>> 
>> ### Patch
>> 
>> - Expands the `MallocLimit` option syntax, allowing for an "oom mode" that mimics an oom:
>> 
>> 
>> 
>> Global form:
>> -XX:MallocLimit=<size>[:<mode>]
>> Category-specific form:
>> -XX:MallocLimit=<category>:<size>[:<mode>][,<category>:<size>[:<mode>] ...]
>> Examples:
>> -XX:MallocLimit=3g
>> -XX:MallocLimit=3g:oom
>> -XX:MallocLimit=compiler:200m:oom
>> 
>> 
>> - moves parsing of `-XX:MallocLimit` out of arguments.cpp into `mallocLimit.hpp/cpp`, and rewrites it.
>> 
>> - changes when limits are checked. Before, the limits were checked *after* the allocation that caused the threshold overflow. Now we check beforehand.
>> 
>> - removes `MallocMaxTestWords`, replacing its uses with `MallocLimit`
>> 
>> - adds new gtests and new jtreg tests
>> 
>> - removes a bunch of jtreg tests which are now better served by the new gtests.
>> 
>> - gives us more precise error messages upon reaching a limit. For example:
>> 
>> before:
>> 
>> #  fatal error: MallocLimit: category "Compiler" reached limit (size: 20997680, limit: 20971520)
>> 
>> 
>> now:
>> 
>> #  fatal error: MallocLimit: reached category "Compiler" limit (triggering allocation size: 1920B, allocated so far: 20505K, limit: 20480K)
>
> Thomas Stuefe has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:
> 
>  - Merge branch 'master' into JDK-8293313-NMT-fake-oom
>  - MallocLimit

This looks good. I've added some nits for style and a couple of comments concerning style which aren't nits. Logic wise, easy to follow and straight forward code, I could find nothing to complain about.

src/hotspot/share/services/mallocLimit.cpp line 43:

> 41:   const char* const _s;
> 42:   const char* const _end;
> 43:   const char* _p;

Please add a comment for what these are! Or a general idea of what occurs (I assume `_p == _end` means that parsing has succeeded?)

src/hotspot/share/services/mallocLimit.cpp line 53:

> 51:   // Advance position on match.
> 52:   bool match_mode_flag(malloclimit_mode_t* out) {
> 53:     if (!eof()) {

Nit:
Swapping the branch condition avoids indenting the main body of code:

if (eof()) return false;
// Go on without else branch

src/hotspot/share/services/mallocLimit.cpp line 71:

> 69:   bool match_category(MEMFLAGS* out) {
> 70:     if (!eof()) {
> 71:       const char* end = strchr(_p, ':');

Nit: This is `strchrnul`

src/hotspot/share/services/mallocLimit.hpp line 33:

> 31: #include "utilities/globalDefinitions.hpp"
> 32: 
> 33: enum class malloclimit_mode_t {

Can we use Hotspot style (`MallocLimitMode`) for this and `malloclimit_t`?

src/hotspot/share/services/mallocLimit.hpp line 40:

> 38: struct malloclimit_t {
> 39:   size_t v;               // Limit size
> 40:   malloclimit_mode_t f;   // Behavior flags

Nit: `sz` and `lim` are short names that still gives more info than `v`, which is nice when reading the code.

test/hotspot/gtest/nmt/test_nmt_malloclimit.cpp line 57:

> 55: 
> 56: static void test(const char* s, const MallocLimitSet& expected) {
> 57: //  LOG_HERE("%s", s);

Seems like a left over from development.

test/hotspot/gtest/nmt/test_nmt_malloclimit.cpp line 60:

> 58:   MallocLimitSet set;
> 59:   const char* err;
> 60:   ASSERT_TRUE(set.parse_malloclimit_option(s, &err)) << err;

Using EXPECT would let the test proceed, is it necessary to use ASSERT here?

test/hotspot/gtest/nmt/test_nmt_malloclimit.cpp line 82:

> 80: }
> 81: 
> 82: TEST(NMT, malloclimit_per_category) {

>malloclimit_per_category

Can we use CamelCase for test case names instead? This is a convention that we *should* use for gtest: http://google.github.io/googletest/faq.html

Unfortunately, we don't consistently do this, but I do want to avoid introducing new instances of this.

test/hotspot/jtreg/runtime/Unsafe/Reallocate.java line 65:

> 63:         // Make sure we can throw an OOME when we fail to reallocate due to OOM
> 64:         try {
> 65:             unsafe.reallocateMemory(address, 100 * 1024 * 1024);

Why are we changing the test here? Just because 800m is a lot larger than the required 100m?

-------------

PR: https://git.openjdk.org/jdk/pull/11371