RFR: 8315923: pretouch_memory by atomic-add-0 fragments huge pages unexpectedly [v17]

Tue Jan 9 10:29:31 UTC 2024

On Thu, 28 Dec 2023 20:17:26 GMT, Kim Barrett <kbarrett at openjdk.org> wrote:

>> test/hotspot/gtest/runtime/test_os_linux.cpp line 377:
>> 
>>> 375:   EXPECT_TRUE(os::release_memory(heap, 1 * G));
>>> 376:   UseTransparentHugePages = useThp;
>>> 377: }
>> 
>> This seems like it's concurrently running `madvise(..., MADV_POPULATE_WRITE)`, correct? This is not what I meant.
>> 
>> What I meant was having at least 2 threads, where one thread is running `os::pretouch_memory` and another using the memory for something. For example, 1 thread pretouching, the other thread filling out the memory with an incrementing integer array `[0,1,2,3,4,...]`. I think this is what Kim meant also, or am I the one misunderstanding him?
>
> [Sorry, I lost track of this and didn't respond to the earlier comment from
> @jdksjolen.]
> 
> Yes, that's correct.  The reason for adding the safe for concurrent use
> pretouch mechanism was https://bugs.openjdk.org/browse/JDK-8260332.
> 
> The idea is that presently, when a thread needs to expand the oldgen, it
> pretouches while holding the expansion lock.  Any other threads that also need
> need the oldgen to be expanded have to wait until the holder of that lock
> completes.  Most of the work involved in expansion is quick and short, but not
> so much for pretouching.  So it was found that we're sometimes blocking a
> bunch of threads for a long-ish time.
> 
> The original proposal there was to allow the otherwise waiting threads to
> cooperate in the pretouch.  But the protocol involved was complicated and
> messy.  A simpler approach was suggested; allow other threads to use the newly
> expanded memory concurrently with the expanding thread doing the pretouch.
> There's obviously some racing there, with the using threads possibly touching
> pages before the pretouching reaches them, but the thinking is that the
> pretouched wave-front will likely surge ahead of the using threads.  And if
> not, then the using threads are effectively cooperating in the "pretouch".
> 
> That approach needed https://bugs.openjdk.org/browse/JDK-8272807 as a building
> block.
> 
> But I discovered there were a bunch of places with similar problems,
> suggesting the need for some more general mechanism. I did a bit of
> prototyping in that direction, but got distracted by other work and haven't
> gotten back to it. (The idea is to record needed pretouching, deferring it up
> the call chain, to a point where other threads are not being blocked waiting
> for the expansion operation. A complicating factor is that some of those
> places may have multiple distinct memory ranges being allocated and needing
> pretouch, all within the same expansion operation.)
> 
> But that approach may interact poorly with the madvise approach. It might be
> that the madvise _should_ be done down inside the expansion operation where
> the pretouches currently happen, rather than being deferred up the call chain
> and permitting the madvise to be concurrent with using threads that might
> introduce the same "shredding" problem the madvise is attempting to fix. That
> would be yet another complicating factor that my prototyping didn't address at
> all.

@limingliu-ampere 's original test was with JVM flags like: `-Xmx24g -Xms24g -Xmn22g -XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:+TransparentHugePages -XX:-UseAdaptiveSizePolicy` etc. Having `-XX:-UseAdaptiveSizePolicy` ensures that `heap->resize_old_gen(size_policy->calculated_old_free_size_in_bytes());` inside `PSParallelCompact::invoke_no_policy` will not be called in this test (see [psParallelCompact.cpp#L1855](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/parallel/psParallelCompact.cpp#L1855)), then it would **not** run into the concern of allocate/expand/pretouch cooperating case on the oldgen, mentioned above by @kimbarrett. 

With regards to `that approach may interact poorly with the madvise approach`, all pretouching triggered by `PSOldGen::expand(size_t bytes)` are currently wrapped by `MutexLocker x(PSOldGenExpand_lock)`. From this viewpoint, the _madvise_ approach does the pretouching work at the same situation as the original _atomic-add-0_ approach. The proposed patch does not make the potential "shredding" problem on the expansion of oldgen worse. Furthermore, back to the table @limingliu-ampere attached at the initial part of this PR, on Kernel 6.1, with `-XX:+TransparentHugePages`, the _madvise_ approach speeds up the pretouching operation from  _atomic-add-0_'s **3.54s** to **0.33s**, which can be an obvious optimization in a manner.

The initial purpose of this patch was to solve an outstanding performance issue on some commercial benchmarks, especially when running with huge heaps, for example, `-Xms200g`, or `-Xms400g`, together with `-XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:+TransparentHugePages -XX:-UseAdaptiveSizePolicy`. The performance regression got well resolved by the patch, and the improvement vs baseline was up to 30%.

All in all, I think this 4-month-old PR is a positive change, and solved a practical performance problem effectively. 

@limingliu-ampere will soon have a update to answer @jdksjolen 's question: `2 threads, where one thread is running os::pretouch_memory and another using the memory for something`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/15781#discussion_r1445905192