RFR: 8310160: Make GC APIs for handling archive heap objects agnostic of GC policy [v2]

Wed Jul 26 04:31:53 UTC 2023

On Wed, 26 Jul 2023 02:53:25 GMT, Ashutosh Mehra <duke at openjdk.org> wrote:

>>> @iklam can you please elaborate a bit on relocation optimizations being done by the patch. Without any background on the idea, it is difficult to infer it from the code.
>> 
>> The algorithm tries to materialize all objects and relocate their oop fields in a single pass. (Each object has a "stream address" (its location in the input stream) and a "materialized address" (its location in the runtime heap))
>> 
>> - Materialize one object from the input stream
>> - Enter the materialized address of this object in the `reloc_table`. Since the input stream is contiguous, we can index `reloc_table` by computing the offset of the `stream` address of this object to the bottom of the input stream.
>> - For each non-null oop pointer in the materialized object:
>>     -  If the pointee's stream address is lower than that of the current object, update the pointer with the pointee's materialized address, which is already in `reloc_table`
>>     - Otherwise, enter the location of this pointer into `reloc_table`, as a linked-list of the `Dst` type, at the `reloc_table` entry of the pointee. When the pointee is materialized, it walks its own `Dst` list, and relocate all pointers to itself.
>> 
>> My branch contains a separate patch for [JDK-8251330](https://bugs.openjdk.org/browse/JDK-8251330) -- the input stream is ordered such that:
>> - the first 50% of the input stream contains no pointers, so relocation can be skipped altogether
>> - in the remaining input stream, about 90% of the 43225 pointers are pointing below the current object, so they can be relocated quickly. Less than 640 `Dst` are needed for the "delayed relocation".
>
> @iklam I agree this is a much better approach and makes the whole process truly collector agnostic. Great work, specially the optimization to re-order the objects.
> 
> Given that this has minimal impact on performance, are we good to go ahead with this approach now?
> 
> One issue I noticed while doing some testing with Shenandoah collector is probably worth pointing out here:
> When using `-XX:+NahlRawAlloc` with very small heap size like -Xmx4m or -Xmx8m the java process freezes. . This happens because the allocations for archive objects causes pacing to kick in and the main thread waits on `ShenandoahPacer::_wait_monitor` [0] to be notified by ShenandoahPeriodicPacerNotify [1]. But the WatcherThread which is responsible for executing the `ShenandoahPeriodicPacerNotify` task does run the periodic tasks until VM init is done [2][3]. So the main thread is stuck now.
> 
> I guess if we do the wait in `ShenandoahPacer::pace_for_alloc` only after VM init is completed, it would resolve this issue.
> 
> I haven't noticed this with `-XX:-NahlRawAlloc`, not sure why that should make any difference.
> 
> Here are the stack traces:
> 
> main thread:
> 
> #5  0x00007f5a1fafbafc in PlatformMonitor::wait (this=this at entry=0x7f5a180f6c78, millis=<optimized out>, millis at entry=10) at src/hotspot/os/posix/mutex_posix.hpp:124
> #6  0x00007f5a1faa3f9c in Monitor::wait (this=0x7f5a180f6c70, timeout=timeout at entry=10) at src/hotspot/share/runtime/mutex.cpp:254
> #7  0x00007f5a1fc2d3bd in ShenandoahPacer::wait (time_ms=10, this=0x7f5a180f6a20) at src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp:286
> #8  ShenandoahPacer::pace_for_alloc (this=0x7f5a180f6a20, words=<optimized out>) at src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp:263
> #9  0x00007f5a1fbfc7e1 in ShenandoahHeap::allocate_memory (this=0x7f5a180ca590, req=...) at src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:855
> #10 0x00007f5a1fbfcb5c in ShenandoahHeap::mem_allocate (this=<optimized out>, size=<optimized out>, gc_overhead_limit_was_exceeded=<optimized out>) at src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:931
> #11 0x00007f5a1f2402c2 in NewQuickLoader::mem_allocate_raw (size=6) at src/hotspot/share/cds/archiveHeapLoader.cpp:493
> #12 NewQuickLoaderImpl<true, true>::allocate (this=<optimized out>, __the_thread__=<optimized out>, size=<synthetic pointer>: 6, stream=0x7f5a1d228850) at src/hotspot/share/cds/archiveHeapLoader.cpp:712
> #13 NewQuickLoaderImpl<true, true>::load_archive_heap_inner<false, false, 3> (__the_thread__=0x7f5a180b8810, str...

Hi @ashu-mehra thanks for testing the patch. I think we all agree that the minor performance impact is acceptable because the code is simpler and more portable. I'll try to clean up my patch and start a PR.

BTW, I have implemented a simpler relocation algorithm with similar performance. It uses less memory and hopefully will be easier to understand as well. The algorithm is described in comments inside archiveHeapLoader.cpp

https://github.com/openjdk/jdk/compare/master...iklam:jdk:8310823-materialize-cds-heap-with-regular-alloc.alternative-relocation?expand=1

As a prerequisite, I'll start a PR for [JDK-8251330: Reorder CDS archived heap to speed up relocation](https://bugs.openjdk.org/browse/JDK-8251330)

Regarding raw allocation, it doesn't seem to be too much faster, so maybe we should disable it, at least for the initial integration.

$ (for i in {1..6}; do perf stat -r 100 java -XX:+NewArchiveHeapLoading -XX:-NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
         0.0162332 +- 0.0000914 seconds time elapsed  ( +-  0.56% )
         0.0161316 +- 0.0000228 seconds time elapsed  ( +-  0.14% )
         0.0161171 +- 0.0000250 seconds time elapsed  ( +-  0.16% )
         0.0161311 +- 0.0000231 seconds time elapsed  ( +-  0.14% )
         0.0161433 +- 0.0000244 seconds time elapsed  ( +-  0.15% )
         0.0161121 +- 0.0000271 seconds time elapsed  ( +-  0.17% )
$ (for i in {1..6}; do perf stat -r 100 java -XX:+NewArchiveHeapLoading -XX:+NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
         0.0160640 +- 0.0000973 seconds time elapsed  ( +-  0.61% )
         0.0159320 +- 0.0000221 seconds time elapsed  ( +-  0.14% )
         0.0159910 +- 0.0000272 seconds time elapsed  ( +-  0.17% )
         0.0159406 +- 0.0000230 seconds time elapsed  ( +-  0.14% )
         0.0159930 +- 0.0000252 seconds time elapsed  ( +-  0.16% )
         0.0159670 +- 0.0000296 seconds time elapsed  ( +-  0.19% )
$ (for i in {1..6}; do perf stat -r 100 java -XX:-NewArchiveHeapLoading -XX:+NahlRawAlloc --version > /dev/null; done) 2>&1 | grep elapsed
         0.0149069 +- 0.0000932 seconds time elapsed  ( +-  0.63% )
         0.0148363 +- 0.0000259 seconds time elapsed  ( +-  0.17% )
         0.0148077 +- 0.0000218 seconds time elapsed  ( +-  0.15% )
         0.0148377 +- 0.0000212 seconds time elapsed  ( +-  0.14% )
         0.0148411 +- 0.0000245 seconds time elapsed  ( +-  0.17% )
         0.0148504 +- 0.0000258 seconds time elapsed  ( +-  0.17% )

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14520#issuecomment-1650961800