RFR: 8310160: Make GC APIs for handling archive heap objects agnostic of GC policy [v2]

Tue Jul 25 07:27:46 UTC 2023

On Mon, 10 Jul 2023 16:12:13 GMT, Ioi Lam <iklam at openjdk.org> wrote:

>>> I hope to implement a fast path for relocation that avoids using the hash tables at all. If we can get the total alloc + reloc time to be about 1.5ms, then it would be just as fast as before when relocation is enabled.
>> 
>> I've implemented a fast relocation lookup. It currently uses a table of the same size as the archived heap objects, but I can reduce that to 1/2 the size.
>> 
>> See https://github.com/openjdk/jdk/compare/master...iklam:jdk:8310823-materialize-cds-heap-with-regular-alloc?expand=1
>> 
>> This is implemented by about 330 lines of code in archiveHeapLoader.cpp. The code is templatized to try out different approaches (like `-XX:+NahlRawAlloc` and `-XX:+NahlUseAccessAPI`), so it can be further simplified.
>> 
>> There's only one thing that's not yet implemented -- the equivalent of `ArchiveHeapLoader::patch_native_pointers()`. I'll do that next.
>> 
>> 
>> $ java  -XX:+NewArchiveHeapLoading -Xmx128m -Xlog:cds+gc --version
>> [0.004s][info][cds,gc] Delayed allocation records alloced: 640
>> [0.004s][info][cds,gc] Load Time: 1388458
>> 
>> 
>> The whole allocation + reloc takes about 1.4ms. It's about 1.25ms slower in the worst case (when the "old" code doesn't have to relocate -- see the `(**)` in the table below). It's 0.8ms slower when the "old" code has to relocate.
>> 
>> 
>> All times are in ms, for "java --version"
>> 
>> ====================================
>> Dump: java -Xshare:dump -Xmx128m
>> 
>> G1         old        new       diff
>>  128m   14.476     15.754     +1.277 (**)
>> 8192m   15.359     16.085     +0.726
>> 
>> 
>> Serial     old        new
>>  128m   13.442     14.241     +0.798
>> 8192m   13.740     14.532     +0.791
>> 
>> ====================================
>> Dump: java -Xshare:dump -Xmx8192m
>> 
>> G1         old        new       diff
>>  128m   14.975     15.787     +0.812
>> 2048m   16.239     17.035     +0.796
>> 8192m   14.821     16.042     +1.221 (**)
>> 
>> 
>> Serial     old        new
>>  128m   13.444     14.167     +0.723
>> 8192m   13.717     14.502     +0.785
>> 
>> 
>> While the code is slower than before, it's a lot simpler. It works on all collectors. I tested on ZGC, but I think Shenandoah should work as well.
>> 
>> The cost is about 1.3 ms per MB of archived heap objects. This may be acceptable as it's a small fraction of JVM bootstrap. We have about 1MB of archived objects now,  and we don't expect this size to drastically increase in the near future.
>> 
>> The extra memory cost is:
>> 
>> - a temporary in-memory copy of the archived heap o...
>
>> @iklam can you please elaborate a bit on relocation optimizations being done by the patch. Without any background on the idea, it is difficult to infer it from the code.
> 
> The algorithm tries to materialize all objects and relocate their oop fields in a single pass. (Each object has a "stream address" (its location in the input stream) and a "materialized address" (its location in the runtime heap))
> 
> - Materialize one object from the input stream
> - Enter the materialized address of this object in the `reloc_table`. Since the input stream is contiguous, we can index `reloc_table` by computing the offset of the `stream` address of this object to the bottom of the input stream.
> - For each non-null oop pointer in the materialized object:
>     -  If the pointee's stream address is lower than that of the current object, update the pointer with the pointee's materialized address, which is already in `reloc_table`
>     - Otherwise, enter the location of this pointer into `reloc_table`, as a linked-list of the `Dst` type, at the `reloc_table` entry of the pointee. When the pointee is materialized, it walks its own `Dst` list, and relocate all pointers to itself.
> 
> My branch contains a separate patch for [JDK-8251330](https://bugs.openjdk.org/browse/JDK-8251330) -- the input stream is ordered such that:
> - the first 50% of the input stream contains no pointers, so relocation can be skipped altogether
> - in the remaining input stream, about 90% of the 43225 pointers are pointing below the current object, so they can be relocated quickly. Less than 640 `Dst` are needed for the "delayed relocation".

Excellent work, @iklam! This work looks very promising to me. Seems like the price we have to pay for the generality is very small indeed, and well worth it. Thank you for doing this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14520#issuecomment-1649269538