RFR: 8310160: Make GC APIs for handling archive heap objects agnostic of GC policy [v2]

Wed Sep 13 23:17:44 UTC 2023

On Wed, 26 Jul 2023 04:28:25 GMT, Ioi Lam <iklam at openjdk.org> wrote:

>> @iklam I agree this is a much better approach and makes the whole process truly collector agnostic. Great work, specially the optimization to re-order the objects.
>> 
>> Given that this has minimal impact on performance, are we good to go ahead with this approach now?
>> 
>> One issue I noticed while doing some testing with Shenandoah collector is probably worth pointing out here:
>> When using `-XX:+NahlRawAlloc` with very small heap size like -Xmx4m or -Xmx8m the java process freezes. . This happens because the allocations for archive objects causes pacing to kick in and the main thread waits on `ShenandoahPacer::_wait_monitor` [0] to be notified by ShenandoahPeriodicPacerNotify [1]. But the WatcherThread which is responsible for executing the `ShenandoahPeriodicPacerNotify` task does run the periodic tasks until VM init is done [2][3]. So the main thread is stuck now.
>> 
>> I guess if we do the wait in `ShenandoahPacer::pace_for_alloc` only after VM init is completed, it would resolve this issue.
>> 
>> I haven't noticed this with `-XX:-NahlRawAlloc`, not sure why that should make any difference.
>> 
>> Here are the stack traces:
>> 
>> main thread:
>> 
>> #5  0x00007f5a1fafbafc in PlatformMonitor::wait (this=this at entry=0x7f5a180f6c78, millis=<optimized out>, millis at entry=10) at src/hotspot/os/posix/mutex_posix.hpp:124
>> #6  0x00007f5a1faa3f9c in Monitor::wait (this=0x7f5a180f6c70, timeout=timeout at entry=10) at src/hotspot/share/runtime/mutex.cpp:254
>> #7  0x00007f5a1fc2d3bd in ShenandoahPacer::wait (time_ms=10, this=0x7f5a180f6a20) at src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp:286
>> #8  ShenandoahPacer::pace_for_alloc (this=0x7f5a180f6a20, words=<optimized out>) at src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp:263
>> #9  0x00007f5a1fbfc7e1 in ShenandoahHeap::allocate_memory (this=0x7f5a180ca590, req=...) at src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:855
>> #10 0x00007f5a1fbfcb5c in ShenandoahHeap::mem_allocate (this=<optimized out>, size=<optimized out>, gc_overhead_limit_was_exceeded=<optimized out>) at src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:931
>> #11 0x00007f5a1f2402c2 in NewQuickLoader::mem_allocate_raw (size=6) at src/hotspot/share/cds/archiveHeapLoader.cpp:493
>> #12 NewQuickLoaderImpl<true, true>::allocate (this=<optimized out>, __the_thread__=<optimized out>, size=<synthetic pointer>: 6, stream=0x7f5a1d228850) at src/hotspot/share/cds/archiveHeapLoader.cpp:712
>> #13 NewQuickLoaderImpl<true, true>::load_archive_heap_inner<fals...
>
> Hi @ashu-mehra thanks for testing the patch. I think we all agree that the minor performance impact is acceptable because the code is simpler and more portable. I'll try to clean up my patch and start a PR.
> 
> BTW, I have implemented a simpler relocation algorithm with similar performance. It uses less memory and hopefully will be easier to understand as well. The algorithm is described in comments inside archiveHeapLoader.cpp
> 
> https://github.com/openjdk/jdk/compare/master...iklam:jdk:8310823-materialize-cds-heap-with-regular-alloc.alternative-relocation?expand=1
> 
> As a prerequisite, I'll start a PR for [JDK-8251330: Reorder CDS archived heap to speed up relocation](https://bugs.openjdk.org/browse/JDK-8251330)
> 
> Regarding raw allocation, it doesn't seem to be too much faster, so maybe we should disable it, at least for the initial integration.
> 
> 
> $ (for i in {1..6}; do perf stat -r 100 java -XX:+NewArchiveHeapLoading -XX:-NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
>          0.0162332 +- 0.0000914 seconds time elapsed  ( +-  0.56% )
>          0.0161316 +- 0.0000228 seconds time elapsed  ( +-  0.14% )
>          0.0161171 +- 0.0000250 seconds time elapsed  ( +-  0.16% )
>          0.0161311 +- 0.0000231 seconds time elapsed  ( +-  0.14% )
>          0.0161433 +- 0.0000244 seconds time elapsed  ( +-  0.15% )
>          0.0161121 +- 0.0000271 seconds time elapsed  ( +-  0.17% )
> $ (for i in {1..6}; do perf stat -r 100 java -XX:+NewArchiveHeapLoading -XX:+NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
>          0.0160640 +- 0.0000973 seconds time elapsed  ( +-  0.61% )
>          0.0159320 +- 0.0000221 seconds time elapsed  ( +-  0.14% )
>          0.0159910 +- 0.0000272 seconds time elapsed  ( +-  0.17% )
>          0.0159406 +- 0.0000230 seconds time elapsed  ( +-  0.14% )
>          0.0159930 +- 0.0000252 seconds time elapsed  ( +-  0.16% )
>          0.0159670 +- 0.0000296 seconds time elapsed  ( +-  0.19% )
> $ (for i in {1..6}; do perf stat -r 100 java -XX:-NewArchiveHeapLoading -XX:+NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
>          0.0149069 +- 0.0000932 seconds time elapsed  ( +-  0.63% )
>          0.0148363 +- 0.0000259 seconds time elapsed  ( +-  0.17% )
>          0.0148077 +- 0.0000218 seconds time elapsed  ( +-  0.15% )
>          0.0148377 +- 0.0000212 seconds time elapsed  ( +-  0.14% )
>          0.0148411 +- 0.0000245 seconds time elapsed  ( +-  0.17% )
>          0.0148504 +- 0.0000258 seconds time elapsed  ( +-  0.17% )

> > @iklam I agree this is a much better approach and makes the whole process truly collector agnostic. Great work, specially the optimization to re-order the objects.
> > Given that this has minimal impact on performance, are we good to go ahead with this approach now?
> > One issue I noticed while doing some testing with Shenandoah collector is probably worth pointing out here: When using `-XX:+NahlRawAlloc` with very small heap size like -Xmx4m or -Xmx8m the java process freezes. . This happens because the allocations for archive objects causes pacing to kick in and the main thread waits on `ShenandoahPacer::_wait_monitor` [0] to be notified by ShenandoahPeriodicPacerNotify [1]. But the WatcherThread which is responsible for executing the `ShenandoahPeriodicPacerNotify` task does run the periodic tasks until VM init is done [2][3]. So the main thread is stuck now.
> > I guess if we do the wait in `ShenandoahPacer::pace_for_alloc` only after VM init is completed, it would resolve this issue.
> 
> Yes, Shenandoah pacing should not run before VM init is complete. This seems like a separate issue.
> 
> Great work in this PR, though!

I have done more testing with my prototype for JDK-8310823 (Materialize CDS archived objects with regular GC allocation) [1]  [2] and found a few issues:

- The minimum heap requirement has increased significantly. We allocate many objects during VM start-up. This may fill up the young generation for some collectors, causing the VM to exit as a GC is not yet possible at this stage. Since the generation sizing policy is different in each GC (and not really controllable via `-XX:NewSize`), the failure mode with small heap size becomes unpredictable. I think this a functional regression, which is more serious that the performance regression in start-up time.
- Although it's possible to enable archive heap objects for ZGC with my JDK-8310823 patch , there's only very marginal improvement in start-up time (probably because ZGC is doing many other tasks concurrently during start-up)

As we expect more heap objects to be archived in Project Leyden, we need a solution that scales well without blowing up the minimum heap size. For example, with the mainline, an 18MB archived heap can be mapped with -Xmx20M, but with my patch, I needed to use -Xmx40M.

To reduce the minimum heap requirement, @fisk has developed a prototype that materializes the archived objects incrementally. However, it's not clear whether future Leyden developments would support such incremental loading.

As we don't see any immediate benefits for JDK-8310823, I would suggest putting that on hold for now, until we get better understanding of the requirements from Leyden.

------
[1] https://github.com/openjdk/jdk/pull/15730
[2] https://bugs.openjdk.org/browse/JDK-8310823

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14520#issuecomment-1718431204