RFR: 8272807: Permit use of memory concurrent with pretouch

Sun Aug 29 12:07:41 UTC 2021

On Tue, 24 Aug 2021 10:58:24 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> Yeah, the overhead is measurable. See for example Epsilon with 100G heap (several runs, most typical result is shown):
> 
> ```
> $ time ~/trunks/jdk/build/baseline/bin/java -Xms100g -Xmx100g -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC Hello
> Hello!
> 
> real	0m23.075s
> user	0m1.880s
> sys	0m21.108s
> 
> $ time ~/trunks/jdk/build/patched/bin/java -Xms100g -Xmx100g -XX:+AlwaysPreTouch -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC Hello
> Hello!
> 
> real	0m23.568s  ; + 500ms
> user	0m2.306s    ; + 420ms
> sys	0m21.189s  ; + 80ms (noise?)
> ```
> 
> This correlates with 100G / 4K = 25M pages to touch with atomics, which gives us roughly additional 500ms/25M = 20 ns per atomic/page (most likely cache-missing atomic costing extra). In the test above, this adds up to ~2% overhead. I do believe this overhead is inconsequential (since user already kinda loses startup performance "privileges" with `-XX:+AlwaysPreTouch` anyway), especially if we would be able to leverage this feature to pre-touch heap in background in future RFEs.
> 
> And this is x86_64. Whereas I see that AArch64 seems to do the call to the helper with `memory_order_conservative` always.

That's a good idea, using the heap pretouch from a simple collector like that.
I went a little further, and patched os::pretouch_memory to report the time
when the range size is large.  I also had to lock the application to a single
NUMA node to get anything like reproducible and comparable results.

I was not able to reproduce the difference you are seeing on x86_64. So far as
I can tell, the performance of the original store-0 pretouch and the proposed
atomic-add0 pretouch are the same (no statistically significant difference)
across three different machines.

I also did the same measurements on an aarch64 machine. I was surprised that
the baseline pretouch performance for it was more than x10 faster (per touched
memory size) than the x86_64 machines I had tested on. I don't know where that
difference comes from. There is some difference because the default page size
is larger on aarch64, so 1/2 as many pages (of x2 size) involved. But I don't
see how that could account for such a large difference.

Unfortunately, the good news stops there. The atomic-add0 pretouch was 1/3
slower than the store-0 pretouch on that machine. That machine is using LSE,
so I also tried forcing that off (so conservative atomic add), as well as
variants that used relaxed cmpxchg (both LSE and not).  All were similar to
one another, so all about 1/3 slower than store-0.

The time to handle the touches is so good on aarch64 that it kind of makes me
wonder whether pretouching is actually worth doing at all there, assuming that
performance isn't model/manufacturer-specific or something like that.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5215