RFR: 8310160: Make GC APIs for handling archive heap objects agnostic of GC policy [v2]
Ioi Lam
iklam at openjdk.org
Wed Sep 13 23:17:44 UTC 2023
On Wed, 26 Jul 2023 04:28:25 GMT, Ioi Lam <iklam at openjdk.org> wrote:
>> @iklam I agree this is a much better approach and makes the whole process truly collector agnostic. Great work, specially the optimization to re-order the objects.
>>
>> Given that this has minimal impact on performance, are we good to go ahead with this approach now?
>>
>> One issue I noticed while doing some testing with Shenandoah collector is probably worth pointing out here:
>> When using `-XX:+NahlRawAlloc` with very small heap size like -Xmx4m or -Xmx8m the java process freezes. . This happens because the allocations for archive objects causes pacing to kick in and the main thread waits on `ShenandoahPacer::_wait_monitor` [0] to be notified by ShenandoahPeriodicPacerNotify [1]. But the WatcherThread which is responsible for executing the `ShenandoahPeriodicPacerNotify` task does run the periodic tasks until VM init is done [2][3]. So the main thread is stuck now.
>>
>> I guess if we do the wait in `ShenandoahPacer::pace_for_alloc` only after VM init is completed, it would resolve this issue.
>>
>> I haven't noticed this with `-XX:-NahlRawAlloc`, not sure why that should make any difference.
>>
>> Here are the stack traces:
>>
>> main thread:
>>
>> #5 0x00007f5a1fafbafc in PlatformMonitor::wait (this=this at entry=0x7f5a180f6c78, millis=<optimized out>, millis at entry=10) at src/hotspot/os/posix/mutex_posix.hpp:124
>> #6 0x00007f5a1faa3f9c in Monitor::wait (this=0x7f5a180f6c70, timeout=timeout at entry=10) at src/hotspot/share/runtime/mutex.cpp:254
>> #7 0x00007f5a1fc2d3bd in ShenandoahPacer::wait (time_ms=10, this=0x7f5a180f6a20) at src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp:286
>> #8 ShenandoahPacer::pace_for_alloc (this=0x7f5a180f6a20, words=<optimized out>) at src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp:263
>> #9 0x00007f5a1fbfc7e1 in ShenandoahHeap::allocate_memory (this=0x7f5a180ca590, req=...) at src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:855
>> #10 0x00007f5a1fbfcb5c in ShenandoahHeap::mem_allocate (this=<optimized out>, size=<optimized out>, gc_overhead_limit_was_exceeded=<optimized out>) at src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:931
>> #11 0x00007f5a1f2402c2 in NewQuickLoader::mem_allocate_raw (size=6) at src/hotspot/share/cds/archiveHeapLoader.cpp:493
>> #12 NewQuickLoaderImpl<true, true>::allocate (this=<optimized out>, __the_thread__=<optimized out>, size=<synthetic pointer>: 6, stream=0x7f5a1d228850) at src/hotspot/share/cds/archiveHeapLoader.cpp:712
>> #13 NewQuickLoaderImpl<true, true>::load_archive_heap_inner<fals...
>
> Hi @ashu-mehra thanks for testing the patch. I think we all agree that the minor performance impact is acceptable because the code is simpler and more portable. I'll try to clean up my patch and start a PR.
>
> BTW, I have implemented a simpler relocation algorithm with similar performance. It uses less memory and hopefully will be easier to understand as well. The algorithm is described in comments inside archiveHeapLoader.cpp
>
> https://github.com/openjdk/jdk/compare/master...iklam:jdk:8310823-materialize-cds-heap-with-regular-alloc.alternative-relocation?expand=1
>
> As a prerequisite, I'll start a PR for [JDK-8251330: Reorder CDS archived heap to speed up relocation](https://bugs.openjdk.org/browse/JDK-8251330)
>
> Regarding raw allocation, it doesn't seem to be too much faster, so maybe we should disable it, at least for the initial integration.
>
>
> $ (for i in {1..6}; do perf stat -r 100 java -XX:+NewArchiveHeapLoading -XX:-NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
> 0.0162332 +- 0.0000914 seconds time elapsed ( +- 0.56% )
> 0.0161316 +- 0.0000228 seconds time elapsed ( +- 0.14% )
> 0.0161171 +- 0.0000250 seconds time elapsed ( +- 0.16% )
> 0.0161311 +- 0.0000231 seconds time elapsed ( +- 0.14% )
> 0.0161433 +- 0.0000244 seconds time elapsed ( +- 0.15% )
> 0.0161121 +- 0.0000271 seconds time elapsed ( +- 0.17% )
> $ (for i in {1..6}; do perf stat -r 100 java -XX:+NewArchiveHeapLoading -XX:+NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
> 0.0160640 +- 0.0000973 seconds time elapsed ( +- 0.61% )
> 0.0159320 +- 0.0000221 seconds time elapsed ( +- 0.14% )
> 0.0159910 +- 0.0000272 seconds time elapsed ( +- 0.17% )
> 0.0159406 +- 0.0000230 seconds time elapsed ( +- 0.14% )
> 0.0159930 +- 0.0000252 seconds time elapsed ( +- 0.16% )
> 0.0159670 +- 0.0000296 seconds time elapsed ( +- 0.19% )
> $ (for i in {1..6}; do perf stat -r 100 java -XX:-NewArchiveHeapLoading -XX:+NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
> 0.0149069 +- 0.0000932 seconds time elapsed ( +- 0.63% )
> 0.0148363 +- 0.0000259 seconds time elapsed ( +- 0.17% )
> 0.0148077 +- 0.0000218 seconds time elapsed ( +- 0.15% )
> 0.0148377 +- 0.0000212 seconds time elapsed ( +- 0.14% )
> 0.0148411 +- 0.0000245 seconds time elapsed ( +- 0.17% )
> 0.0148504 +- 0.0000258 seconds time elapsed ( +- 0.17% )
> > @iklam I agree this is a much better approach and makes the whole process truly collector agnostic. Great work, specially the optimization to re-order the objects.
> > Given that this has minimal impact on performance, are we good to go ahead with this approach now?
> > One issue I noticed while doing some testing with Shenandoah collector is probably worth pointing out here: When using `-XX:+NahlRawAlloc` with very small heap size like -Xmx4m or -Xmx8m the java process freezes. . This happens because the allocations for archive objects causes pacing to kick in and the main thread waits on `ShenandoahPacer::_wait_monitor` [0] to be notified by ShenandoahPeriodicPacerNotify [1]. But the WatcherThread which is responsible for executing the `ShenandoahPeriodicPacerNotify` task does run the periodic tasks until VM init is done [2][3]. So the main thread is stuck now.
> > I guess if we do the wait in `ShenandoahPacer::pace_for_alloc` only after VM init is completed, it would resolve this issue.
>
> Yes, Shenandoah pacing should not run before VM init is complete. This seems like a separate issue.
>
> Great work in this PR, though!
I have done more testing with my prototype for JDK-8310823 (Materialize CDS archived objects with regular GC allocation) [1] [2] and found a few issues:
- The minimum heap requirement has increased significantly. We allocate many objects during VM start-up. This may fill up the young generation for some collectors, causing the VM to exit as a GC is not yet possible at this stage. Since the generation sizing policy is different in each GC (and not really controllable via `-XX:NewSize`), the failure mode with small heap size becomes unpredictable. I think this a functional regression, which is more serious that the performance regression in start-up time.
- Although it's possible to enable archive heap objects for ZGC with my JDK-8310823 patch , there's only very marginal improvement in start-up time (probably because ZGC is doing many other tasks concurrently during start-up)
As we expect more heap objects to be archived in Project Leyden, we need a solution that scales well without blowing up the minimum heap size. For example, with the mainline, an 18MB archived heap can be mapped with -Xmx20M, but with my patch, I needed to use -Xmx40M.
To reduce the minimum heap requirement, @fisk has developed a prototype that materializes the archived objects incrementally. However, it's not clear whether future Leyden developments would support such incremental loading.
As we don't see any immediate benefits for JDK-8310823, I would suggest putting that on hold for now, until we get better understanding of the requirements from Leyden.
------
[1] https://github.com/openjdk/jdk/pull/15730
[2] https://bugs.openjdk.org/browse/JDK-8310823
-------------
PR Comment: https://git.openjdk.org/jdk/pull/14520#issuecomment-1718431204
More information about the hotspot-gc-dev
mailing list