RFR: 8310160: Make GC APIs for handling archive heap objects agnostic of GC policy [v2]

Mon Jul 10 16:15:14 UTC 2023

On Mon, 10 Jul 2023 05:35:53 GMT, Ioi Lam <iklam at openjdk.org> wrote:

>>> > I first ran java -Xshare:dump so all the subsequent java --version runs use the same heap size as dump time. As a result, my "before" runs had a heap relocation delta of zero, which should correspond to the best start-up time.
>>> 
>>> Okay, thanks for clarifying. I thought `java --version` runs were using the default archive.
>> 
>> I haven't done any optimizations yet, but I fixed a few problems in the slow-path code. 
>> 
>> https://github.com/openjdk/jdk/compare/master...iklam:jdk:8310823-materialize-cds-heap-with-regular-alloc?expand=1
>> 
>> 
>> # Before: no relocation
>> $ perf stat -r 40 java --version > /dev/null
>>           0.015872 +- 0.000238 seconds time elapsed  ( +-  1.50% )
>> 
>> # Before: force relocation (quick)
>> $ perf stat -r 40 java -Xmx4g --version > /dev/null
>>           0.016691 +- 0.000385 seconds time elapsed  ( +-  2.31% )
>> 
>> # Before: force relocation ("quick relocation not possible")
>> $ perf stat -r 40 java -Xmx2g --version > /dev/null
>>           0.017385 +- 0.000230 seconds time elapsed  ( +-  1.32% )
>> 
>> # After
>> $ perf stat -r 40 java -XX:+NewArchiveHeapLoading --version > /dev/null
>>           0.018780 +- 0.000225 seconds time elapsed  ( +-  1.20% )
>> 
>> 
>> So the slow path is just about 3ms slower than the fastest "before" case.
>> 
>> Looking at the detailed timing break down (`os::thread_cpu_time()` = ns):
>> 
>> 
>> $ java -XX:+NewArchiveHeapLoading -Xlog:cds+gc --version
>> [0.006s][info][cds,gc] Num objs                    :                24184
>> [0.006s][info][cds,gc] Num bytes                   :              1074640
>> [0.006s][info][cds,gc] Per obj bytes               :                   44
>> [0.006s][info][cds,gc] Num references (incl nulls) :                87109
>> [0.006s][info][cds,gc] Num references relocated    :                43225
>> [0.006s][info][cds,gc] Allocation Time             :              1605084 <<<< A
>> [0.006s][info][cds,gc] Relocation Time             :              1246894
>> [0.006s][info][cds,gc] Table(s) dispose Time       :                 1306
>> 
>> $ java -XX:+NewArchiveHeapLoading -XX:NewArchiveHeapNumAllocs=2 -Xlog:cds+gc --version
>> [0.006s][info][cds,gc] Allocation Time             :              2203781 <<<< B
>> 
>> $ java -XX:+NewArchiveHeapLoading -XX:NewArchiveHeapNumAllocs=-1 -Xlog:cds+gc --version
>> [0.003s][info][cds,gc] Allocation Time             :               282125 <<<< C
>> 
>> $ java -XX:+NewArchiveHeapLoading -XX:NewArchiveHeapNumAllocs=0 -Xlog:cds+gc --version
>> [0.004s][inf...
>
>> I hope to implement a fast path for relocation that avoids using the hash tables at all. If we can get the total alloc + reloc time to be about 1.5ms, then it would be just as fast as before when relocation is enabled.
> 
> I've implemented a fast relocation lookup. It currently uses a table of the same size as the archived heap objects, but I can reduce that to 1/2 the size.
> 
> See https://github.com/openjdk/jdk/compare/master...iklam:jdk:8310823-materialize-cds-heap-with-regular-alloc?expand=1
> 
> This is implemented by about 330 lines of code in archiveHeapLoader.cpp. The code is templatized to try out different approaches (like `-XX:+NahlRawAlloc` and `-XX:+NahlUseAccessAPI`), so it can be further simplified.
> 
> There's only one thing that's not yet implemented -- the equivalent of `ArchiveHeapLoader::patch_native_pointers()`. I'll do that next.
> 
> 
> $ java  -XX:+NewArchiveHeapLoading -Xmx128m -Xlog:cds+gc --version
> [0.004s][info][cds,gc] Delayed allocation records alloced: 640
> [0.004s][info][cds,gc] Load Time: 1388458
> 
> 
> The whole allocation + reloc takes about 1.4ms. It's about 1.25ms slower in the worst case (when the "old" code doesn't have to relocate -- see the `(**)` in the table below). It's 0.8ms slower when the "old" code has to relocate.
> 
> 
> All times are in ms, for "java --version"
> 
> ====================================
> Dump: java -Xshare:dump -Xmx128m
> 
> G1         old        new       diff
>  128m   14.476     15.754     +1.277 (**)
> 8192m   15.359     16.085     +0.726
> 
> 
> Serial     old        new
>  128m   13.442     14.241     +0.798
> 8192m   13.740     14.532     +0.791
> 
> ====================================
> Dump: java -Xshare:dump -Xmx8192m
> 
> G1         old        new       diff
>  128m   14.975     15.787     +0.812
> 2048m   16.239     17.035     +0.796
> 8192m   14.821     16.042     +1.221 (**)
> 
> 
> Serial     old        new
>  128m   13.444     14.167     +0.723
> 8192m   13.717     14.502     +0.785
> 
> 
> While the code is slower than before, it's a lot simpler. It works on all collectors. I tested on ZGC, but I think Shenandoah should work as well.
> 
> The cost is about 1.3 ms per MB of archived heap objects. This may be acceptable as it's a small fraction of JVM bootstrap. We have about 1MB of archived objects now,  and we don't expect this size to drastically increase in the near future.
> 
> The extra memory cost is:
> 
> - a temporary in-memory copy of the archived heap objects
> - a temporary table of 1/2 the size of the archived heap objects
> 
> The former can be reduced by readi...

> @iklam can you please elaborate a bit on relocation optimizations being done by the patch. Without any background on the idea, it is difficult to infer it from the code.

The algorithm tries to materialize all objects and relocate their oop fields in a single pass. (Each object has a "stream address" (its location in the input stream) and a "materialized address" (its location in the runtime heap))

- Materialize one object from the input stream
- Enter the materialized address of this object in the `reloc_table`. Since the input stream is contiguous, we can index `reloc_table` by computing the offset of the `stream` address of this object to the bottom of the input stream.
- For each non-null oop pointer in the materialized object:
    -  If the pointee's stream address is lower than that of the current object, update the pointer with the pointee's materialized address, which is already in `reloc_table`
    - Otherwise, enter the location of this pointer into `reloc_table`, as a linked-list of the `Dst` type, at the `reloc_table` of the pointee. When the pointee is materialized, it walks its own `Dst` list, and relocate all pointers to itself.

My branch contains a separate patch for [JDK-8251330](https://bugs.openjdk.org/browse/JDK-8251330) -- the input stream is ordered such that:
- the first 50% of the input stream contains no pointers, so relocation can be skipped altogether
- in the remaining input stream, about 90% of the 43225 pointers are pointing below the current object, so they can be relocated quickly. Less than 640 `Dst` are needed for the "delayed relocation".

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14520#issuecomment-1629268586