RFR: 8315923: pretouch_memory by atomic-add-0 fragments huge pages unexpectedly [v17]

Wed Jan 10 02:10:33 UTC 2024

On Tue, 9 Jan 2024 10:26:19 GMT, Patrick Zhang <qpzhang at openjdk.org> wrote:

>> [Sorry, I lost track of this and didn't respond to the earlier comment from
>> @jdksjolen.]
>> 
>> Yes, that's correct.  The reason for adding the safe for concurrent use
>> pretouch mechanism was https://bugs.openjdk.org/browse/JDK-8260332.
>> 
>> The idea is that presently, when a thread needs to expand the oldgen, it
>> pretouches while holding the expansion lock.  Any other threads that also need
>> need the oldgen to be expanded have to wait until the holder of that lock
>> completes.  Most of the work involved in expansion is quick and short, but not
>> so much for pretouching.  So it was found that we're sometimes blocking a
>> bunch of threads for a long-ish time.
>> 
>> The original proposal there was to allow the otherwise waiting threads to
>> cooperate in the pretouch.  But the protocol involved was complicated and
>> messy.  A simpler approach was suggested; allow other threads to use the newly
>> expanded memory concurrently with the expanding thread doing the pretouch.
>> There's obviously some racing there, with the using threads possibly touching
>> pages before the pretouching reaches them, but the thinking is that the
>> pretouched wave-front will likely surge ahead of the using threads.  And if
>> not, then the using threads are effectively cooperating in the "pretouch".
>> 
>> That approach needed https://bugs.openjdk.org/browse/JDK-8272807 as a building
>> block.
>> 
>> But I discovered there were a bunch of places with similar problems,
>> suggesting the need for some more general mechanism. I did a bit of
>> prototyping in that direction, but got distracted by other work and haven't
>> gotten back to it. (The idea is to record needed pretouching, deferring it up
>> the call chain, to a point where other threads are not being blocked waiting
>> for the expansion operation. A complicating factor is that some of those
>> places may have multiple distinct memory ranges being allocated and needing
>> pretouch, all within the same expansion operation.)
>> 
>> But that approach may interact poorly with the madvise approach. It might be
>> that the madvise _should_ be done down inside the expansion operation where
>> the pretouches currently happen, rather than being deferred up the call chain
>> and permitting the madvise to be concurrent with using threads that might
>> introduce the same "shredding" problem the madvise is attempting to fix. That
>> would be yet another complicating factor that my prototyping didn't address at
>> all.
>
> @limingliu-ampere 's original test was with JVM flags like: `-Xmx24g -Xms24g -Xmn22g -XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:+TransparentHugePages -XX:-UseAdaptiveSizePolicy` etc. Having `-XX:-UseAdaptiveSizePolicy` ensures that `heap->resize_old_gen(size_policy->calculated_old_free_size_in_bytes());` inside `PSParallelCompact::invoke_no_policy` will not be called in this test (see [psParallelCompact.cpp#L1855](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/parallel/psParallelCompact.cpp#L1855)), then it would **not** run into the concern of allocate/expand/pretouch cooperating case on the oldgen, mentioned above by @kimbarrett. 
> 
> With regards to `that approach may interact poorly with the madvise approach`, all pretouching triggered by `PSOldGen::expand(size_t bytes)` are currently wrapped by `MutexLocker x(PSOldGenExpand_lock)`. From this viewpoint, the _madvise_ approach does the pretouching work at the same situation as the original _atomic-add-0_ approach. The proposed patch does not make the potential "shredding" problem on the expansion of oldgen worse. Furthermore, back to the table @limingliu-ampere attached at the initial part of this PR, on Kernel 6.1, with `-XX:+TransparentHugePages`, the _madvise_ approach speeds up the pretouching operation from  _atomic-add-0_'s **3.54s** to **0.33s**, which can be an obvious optimization in a manner.
> 
> The initial purpose of this patch was to solve an outstanding performance issue on some commercial benchmarks, especially when running with huge heaps, for example, `-Xms200g`, or `-Xms400g`, together with `-XX:+UseParallelGC -XX:+AlwaysPreTouch -XX:+TransparentHugePages -XX:-UseAdaptiveSizePolicy`. The performance regression got well resolved by the patch, and the improvement vs baseline was up to 30%.
> 
> All in all, I think this 4-month-old PR is a positive change, and solved a practical performance problem effectively. 
> 
> @limingliu-ampere will soon have a update to answer @jdksjolen 's question: `2 threads, where one thread is running os::pretouch_memory and another using the memory for something`.

The testcase had been changed to use one thread writing some integers to memory and seven threads pretouching memory. And it expects that the contents are preserved after concurrent use and pretouch.

The testcase had passed on both linux-x64 and linux-x86. Please check it, thanks.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/15781#discussion_r1446806236