RFR: 8310160: Make GC APIs for handling archive heap objects agnostic of GC policy [v2]

Thu Aug 10 08:17:03 UTC 2023

On Mon, 31 Jul 2023 20:29:46 GMT, Dan Heidinga <heidinga at openjdk.org> wrote:

>>> I hope to implement a fast path for relocation that avoids using the hash tables at all. If we can get the total alloc + reloc time to be about 1.5ms, then it would be just as fast as before when relocation is enabled.
>> 
>> I've implemented a fast relocation lookup. It currently uses a table of the same size as the archived heap objects, but I can reduce that to 1/2 the size.
>> 
>> See https://github.com/openjdk/jdk/compare/master...iklam:jdk:8310823-materialize-cds-heap-with-regular-alloc?expand=1
>> 
>> This is implemented by about 330 lines of code in archiveHeapLoader.cpp. The code is templatized to try out different approaches (like `-XX:+NahlRawAlloc` and `-XX:+NahlUseAccessAPI`), so it can be further simplified.
>> 
>> There's only one thing that's not yet implemented -- the equivalent of `ArchiveHeapLoader::patch_native_pointers()`. I'll do that next.
>> 
>> 
>> $ java  -XX:+NewArchiveHeapLoading -Xmx128m -Xlog:cds+gc --version
>> [0.004s][info][cds,gc] Delayed allocation records alloced: 640
>> [0.004s][info][cds,gc] Load Time: 1388458
>> 
>> 
>> The whole allocation + reloc takes about 1.4ms. It's about 1.25ms slower in the worst case (when the "old" code doesn't have to relocate -- see the `(**)` in the table below). It's 0.8ms slower when the "old" code has to relocate.
>> 
>> 
>> All times are in ms, for "java --version"
>> 
>> ====================================
>> Dump: java -Xshare:dump -Xmx128m
>> 
>> G1         old        new       diff
>>  128m   14.476     15.754     +1.277 (**)
>> 8192m   15.359     16.085     +0.726
>> 
>> 
>> Serial     old        new
>>  128m   13.442     14.241     +0.798
>> 8192m   13.740     14.532     +0.791
>> 
>> ====================================
>> Dump: java -Xshare:dump -Xmx8192m
>> 
>> G1         old        new       diff
>>  128m   14.975     15.787     +0.812
>> 2048m   16.239     17.035     +0.796
>> 8192m   14.821     16.042     +1.221 (**)
>> 
>> 
>> Serial     old        new
>>  128m   13.444     14.167     +0.723
>> 8192m   13.717     14.502     +0.785
>> 
>> 
>> While the code is slower than before, it's a lot simpler. It works on all collectors. I tested on ZGC, but I think Shenandoah should work as well.
>> 
>> The cost is about 1.3 ms per MB of archived heap objects. This may be acceptable as it's a small fraction of JVM bootstrap. We have about 1MB of archived objects now,  and we don't expect this size to drastically increase in the near future.
>> 
>> The extra memory cost is:
>> 
>> - a temporary in-memory copy of the archived heap o...
>
>> The cost is about 1.3 ms per MB of archived heap objects. This may be acceptable as it's a small fraction of JVM bootstrap. We have about 1MB of archived objects now, and we don't expect this size to drastically increase in the near future.
> 
> Looking ahead to Project Leyden, I wouldn't be surprised if the current 1MB of archived heap became much larger.  Demo apps on Graal Native Image are often ~4MB of image heap, and while it's not an apples-to-apples comparison, it suggests that somewhere between 5MB & 10MB isn't unreasonable for Leyden.
> 
> Using 10MB as a baseline for easy math,  1.3ms/MB * 10MB = 13 ms for the new code?  And (1.3ms-0.8ms) = 0.5ms/MB * 10MB = 5ms for the old code?  Assuming I've interpreted the numbers correctly and importably that they scale linearly, it seems worth preserving the mmap approach for collectors that can support it.
> 
> Does that seem reasonable?  And justify preserving the mmap approach?

I share @DanHeidinga 's concern about scaling. Is it just me or is 1.3ms/MB rather high?

Even without mmap, do we have to allocate memory for each object individually? Could we not allocate a large block of memory for all objects, and copy them over linearly? Or allow the heap to overprovision, such that heaps with humongous limits can observe internal limits like that? E.g. Universe::heap()->mem_allocate_maybe_more(minsize, maxsize, &result_size).

One nice effect Ioi's approach has is that we could finally eliminate the requirement of "if you map archived heap objects, the narrow Klass encoding base has to be *exactly* the start of the shared metaspace region". Because when we copy the objects, we can now, at little additional cost, recompute the narrow Klass IDs too. That could allow us to use unscaled nKlass encoding at CDS runtime, which we could not do before.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14520#issuecomment-1672747936