RFR: 8310160: Make GC APIs for handling archive heap objects agnostic of GC policy [v2]
Ashutosh Mehra
duke at openjdk.org
Wed Jul 26 02:56:50 UTC 2023
On Mon, 10 Jul 2023 16:12:13 GMT, Ioi Lam <iklam at openjdk.org> wrote:
>>> I hope to implement a fast path for relocation that avoids using the hash tables at all. If we can get the total alloc + reloc time to be about 1.5ms, then it would be just as fast as before when relocation is enabled.
>>
>> I've implemented a fast relocation lookup. It currently uses a table of the same size as the archived heap objects, but I can reduce that to 1/2 the size.
>>
>> See https://github.com/openjdk/jdk/compare/master...iklam:jdk:8310823-materialize-cds-heap-with-regular-alloc?expand=1
>>
>> This is implemented by about 330 lines of code in archiveHeapLoader.cpp. The code is templatized to try out different approaches (like `-XX:+NahlRawAlloc` and `-XX:+NahlUseAccessAPI`), so it can be further simplified.
>>
>> There's only one thing that's not yet implemented -- the equivalent of `ArchiveHeapLoader::patch_native_pointers()`. I'll do that next.
>>
>>
>> $ java -XX:+NewArchiveHeapLoading -Xmx128m -Xlog:cds+gc --version
>> [0.004s][info][cds,gc] Delayed allocation records alloced: 640
>> [0.004s][info][cds,gc] Load Time: 1388458
>>
>>
>> The whole allocation + reloc takes about 1.4ms. It's about 1.25ms slower in the worst case (when the "old" code doesn't have to relocate -- see the `(**)` in the table below). It's 0.8ms slower when the "old" code has to relocate.
>>
>>
>> All times are in ms, for "java --version"
>>
>> ====================================
>> Dump: java -Xshare:dump -Xmx128m
>>
>> G1 old new diff
>> 128m 14.476 15.754 +1.277 (**)
>> 8192m 15.359 16.085 +0.726
>>
>>
>> Serial old new
>> 128m 13.442 14.241 +0.798
>> 8192m 13.740 14.532 +0.791
>>
>> ====================================
>> Dump: java -Xshare:dump -Xmx8192m
>>
>> G1 old new diff
>> 128m 14.975 15.787 +0.812
>> 2048m 16.239 17.035 +0.796
>> 8192m 14.821 16.042 +1.221 (**)
>>
>>
>> Serial old new
>> 128m 13.444 14.167 +0.723
>> 8192m 13.717 14.502 +0.785
>>
>>
>> While the code is slower than before, it's a lot simpler. It works on all collectors. I tested on ZGC, but I think Shenandoah should work as well.
>>
>> The cost is about 1.3 ms per MB of archived heap objects. This may be acceptable as it's a small fraction of JVM bootstrap. We have about 1MB of archived objects now, and we don't expect this size to drastically increase in the near future.
>>
>> The extra memory cost is:
>>
>> - a temporary in-memory copy of the archived heap o...
>
>> @iklam can you please elaborate a bit on relocation optimizations being done by the patch. Without any background on the idea, it is difficult to infer it from the code.
>
> The algorithm tries to materialize all objects and relocate their oop fields in a single pass. (Each object has a "stream address" (its location in the input stream) and a "materialized address" (its location in the runtime heap))
>
> - Materialize one object from the input stream
> - Enter the materialized address of this object in the `reloc_table`. Since the input stream is contiguous, we can index `reloc_table` by computing the offset of the `stream` address of this object to the bottom of the input stream.
> - For each non-null oop pointer in the materialized object:
> - If the pointee's stream address is lower than that of the current object, update the pointer with the pointee's materialized address, which is already in `reloc_table`
> - Otherwise, enter the location of this pointer into `reloc_table`, as a linked-list of the `Dst` type, at the `reloc_table` entry of the pointee. When the pointee is materialized, it walks its own `Dst` list, and relocate all pointers to itself.
>
> My branch contains a separate patch for [JDK-8251330](https://bugs.openjdk.org/browse/JDK-8251330) -- the input stream is ordered such that:
> - the first 50% of the input stream contains no pointers, so relocation can be skipped altogether
> - in the remaining input stream, about 90% of the 43225 pointers are pointing below the current object, so they can be relocated quickly. Less than 640 `Dst` are needed for the "delayed relocation".
@iklam I agree this is a much better approach and makes the whole process truly collector agnostic. Great work, specially the optimization to re-order the objects.
Given that this has minimal impact on performance, are we good to go ahead with this approach now?
One issue I noticed while doing some testing with Shenandoah collector is probably worth pointing out here:
When using `-XX:+NahlRawAlloc` with very small heap size like -Xmx4m or -Xmx8m the java process freezes. . This happens because the allocations for archive objects causes pacing to kick in and the main thread waits on `ShenandoahPacer::_wait_monitor` [0] to be notified by ShenandoahPeriodicPacerNotify [1]. But the WatcherThread which is responsible for executing the `ShenandoahPeriodicPacerNotify` task does run the periodic tasks until VM init is done [2][3]. So the main thread is stuck now.
I guess if we do the wait in `ShenandoahPacer::pace_for_alloc` only after VM init is completed, it would resolve this issue.
I haven't noticed this with `-XX:-NahlRawAlloc`, not sure why that should make any difference.
Here are the stack traces:
main thread:
#5 0x00007f5a1fafbafc in PlatformMonitor::wait (this=this at entry=0x7f5a180f6c78, millis=<optimized out>, millis at entry=10) at src/hotspot/os/posix/mutex_posix.hpp:124
#6 0x00007f5a1faa3f9c in Monitor::wait (this=0x7f5a180f6c70, timeout=timeout at entry=10) at src/hotspot/share/runtime/mutex.cpp:254
#7 0x00007f5a1fc2d3bd in ShenandoahPacer::wait (time_ms=10, this=0x7f5a180f6a20) at src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp:286
#8 ShenandoahPacer::pace_for_alloc (this=0x7f5a180f6a20, words=<optimized out>) at src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp:263
#9 0x00007f5a1fbfc7e1 in ShenandoahHeap::allocate_memory (this=0x7f5a180ca590, req=...) at src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:855
#10 0x00007f5a1fbfcb5c in ShenandoahHeap::mem_allocate (this=<optimized out>, size=<optimized out>, gc_overhead_limit_was_exceeded=<optimized out>) at src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp:931
#11 0x00007f5a1f2402c2 in NewQuickLoader::mem_allocate_raw (size=6) at src/hotspot/share/cds/archiveHeapLoader.cpp:493
#12 NewQuickLoaderImpl<true, true>::allocate (this=<optimized out>, __the_thread__=<optimized out>, size=<synthetic pointer>: 6, stream=0x7f5a1d228850) at src/hotspot/share/cds/archiveHeapLoader.cpp:712
#13 NewQuickLoaderImpl<true, true>::load_archive_heap_inner<false, false, 3> (__the_thread__=0x7f5a180b8810, stream_top=0x7f5a1d28e168, stream=0x7f5a1d228850, stream_bottom=0x7f5a1d204000, this=0x7f5a1edfe9c0)
at src/hotspot/share/cds/archiveHeapLoader.cpp:634
#14 NewQuickLoaderImpl<true, true>::load_archive_heap (this=this at entry=0x7f5a1edfe9c0, __the_thread__=__the_thread__ at entry=0x7f5a180b8810) at src/hotspot/share/cds/archiveHeapLoader.cpp:603
#15 0x00007f5a1f22da2a in ArchiveHeapLoader::new_fixup_region (__the_thread__=__the_thread__ at entry=0x7f5a180b8810) at src/hotspot/share/cds/archiveHeapLoader.cpp:806
#16 0x00007f5a1f22f00b in ArchiveHeapLoader::fixup_region () at src/hotspot/share/cds/archiveHeapLoader.cpp:90
#17 0x00007f5a1fde5bd5 in vmClasses::resolve_all (__the_thread__=__the_thread__ at entry=0x7f5a180b8810) at src/hotspot/share/classfile/vmClasses.cpp:145
#18 0x00007f5a1fd186a9 in SystemDictionary::initialize (__the_thread__=__the_thread__ at entry=0x7f5a180b8810) at src/hotspot/share/classfile/systemDictionary.cpp:1616
#19 0x00007f5a1fd8b960 in Universe::genesis (__the_thread__=__the_thread__ at entry=0x7f5a180b8810) at src/hotspot/share/memory/universe.cpp:356
#20 0x00007f5a1fd8d093 in universe2_init () at src/hotspot/share/memory/universe.cpp:977
#21 0x00007f5a1f6e10f9 in init_globals2 () at src/hotspot/share/runtime/init.cpp:150
#22 0x00007f5a1fd6c729 in Threads::create_vm (args=<optimized out>, canTryAgain=canTryAgain at entry=0x7f5a1edfedaf) at src/hotspot/share/runtime/threads.cpp:568
#23 0x00007f5a1f7af69e in JNI_CreateJavaVM_inner (args=<optimized out>, penv=0x7f5a1edfee58, vm=0x7f5a1edfee50) at src/hotspot/share/prims/jni.cpp:3577
#24 JNI_CreateJavaVM (vm=0x7f5a1edfee50, penv=0x7f5a1edfee58, args=<optimized out>) at src/hotspot/share/prims/jni.cpp:3668
#25 0x00007f5a206d81ef in InitializeJVM (ifn=<synthetic pointer>, penv=0x7f5a1edfee58, pvm=0x7f5a1edfee50) at src/java.base/share/native/libjli/java.c:1506
#26 JavaMain (_args=<optimized out>) at src/java.base/share/native/libjli/java.c:415
#27 0x00007f5a206dbe29 in ThreadJavaMain (args=<optimized out>) at src/java.base/unix/native/libjli/java_md.c:650
#28 0x00007f5a20582907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#29 0x00007f5a20608870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Watcher thread:
#4 ___pthread_cond_timedwait64 (cond=0x7f5a202f0e30 <mutex_init()::PeriodicTask_lock_storage+48>, mutex=<optimized out>, abstime=0x7f5a1d126d20) at pthread_cond_wait.c:643
#5 0x00007f5a1fafbafc in PlatformMonitor::wait (this=this at entry=0x7f5a202f0e08 <mutex_init()::PeriodicTask_lock_storage+8>, millis=<optimized out>, millis at entry=100) at src/hotspot/os/posix/mutex_posix.hpp:124
#6 0x00007f5a1faa3f19 in Monitor::wait_without_safepoint_check (this=this at entry=0x7f5a202f0e00 <mutex_init()::PeriodicTask_lock_storage>, timeout=timeout at entry=100) at src/hotspot/share/runtime/mutex.cpp:226
#7 0x00007f5a1fabcded in MonitorLocker::wait (this=<synthetic pointer>, timeout=100) at src/hotspot/share/runtime/mutexLocker.hpp:254
#8 WatcherThread::sleep (this=this at entry=0x7f5a1810a290) at src/hotspot/share/runtime/nonJavaThread.cpp:189
#9 0x00007f5a1fabce61 in WatcherThread::run (this=0x7f5a1810a290) at src/hotspot/share/runtime/nonJavaThread.cpp:249
#10 0x00007f5a1fd5f49f in Thread::call_run (this=0x7f5a1810a290) at src/hotspot/share/runtime/thread.cpp:217
#11 0x00007f5a1faf0a9a in thread_native_entry (thread=0x7f5a1810a290) at src/hotspot/os/linux/os_linux.cpp:779
#12 0x00007f5a20582907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#13 0x00007f5a20608870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
[0] https://github.com/openjdk/jdk/blob/2d05d3545c8fe4d9e5ad3cee673fc938f84d1901/src/hotspot/share/gc/shenandoah/shenandoahPacer.cpp#L263
[1] https://github.com/openjdk/jdk/blob/2d05d3545c8fe4d9e5ad3cee673fc938f84d1901/src/hotspot/share/gc/shenandoah/shenandoahControlThread.cpp#L75-L78
[2] https://github.com/openjdk/jdk/blob/2d05d3545c8fe4d9e5ad3cee673fc938f84d1901/src/hotspot/share/runtime/threads.cpp#L804-L808
[3] https://github.com/openjdk/jdk/blob/2d05d3545c8fe4d9e5ad3cee673fc938f84d1901/src/hotspot/share/runtime/nonJavaThread.cpp#L288
-------------
PR Comment: https://git.openjdk.org/jdk/pull/14520#issuecomment-1650896237
More information about the hotspot-gc-dev
mailing list