RFR(S) 8245925 G1 allocates EDEN region after CDS has executed GC
Thomas Schatzl
thomas.schatzl at oracle.com
Mon Jun 8 14:12:37 UTC 2020
Hi,
On 06.06.20 07:51, Ioi Lam wrote:
>
>
> On 6/5/20 2:38 PM, Jiangli Zhou wrote:
>> Hi Ioi,
>>
>> Thanks for the updated webrev.
>>
>> To avoid the assert, could you call GCLocker::stall_until_clear in
>> MetaspaceShared::preload_and_dump() before grabbing the 'Heap_lock'
>> and VMThread::execute(&op), like following. It is the best to check
>> with Thomas and other GC experts about the interaction between
>> 'Heap_lock' and 'GCLocker'.
>>
>> if (HeapShared::is_heap_object_archiving_allowed()) {
>> GCLocker::stall_until_clear();
>> }
> Hi Jiangli,
>
> Thanks for the review.
>
> I don't think this is sufficient. Immediately after
> GCLocker::stall_until_clear() has returned, another thread can invoke a
> JNI function that will activate the GCLocker. That's stated by the
> gcLocker.hpp comments:
>
> http://hg.openjdk.java.net/jdk/jdk/file/882b61be2c19/src/hotspot/share/gc/shared/gcLocker.hpp#l109
The GCLocker::stall_until_clear()/GClocker::is_active() loop seems to
try to achieve something different than what the CR tries to fix.
The change without that loop/GClocker::stall_until_clear() already fully
achieves that there is no allocation between compaction and the CDS dump
(at least for STW collectors or collectors that can do a "full"
collection during an STW pause.)
This additional code to wait for GCLocker being inactive to always get a
contiguous range of memory is a different matter:
Using GCLocker::stall_until_clear() does not work: it would only wait
for completion of all JNI critical sections if a gc had been requested
while these JNI critical sections were active. Otherwise it would do
nothing and let the caller continue, some threads still being in a
critical section and the gclocker held. Which means that the following
full gc may still "fail".
(That gc would set needs_gc and then it would work. But since you did
not know whether the first full gc has been successful, you would need
another one. Since at that point the return from the last JNI CS would
trigger a young gc, you have two extra gcs in the common case...).
Another serious problem is that the stall_until_clear() code is written
to be only called in a Java thread. I do not think it works in the VM
thread without lots of more thought (and at least necessary changes to
asserts). As Jiangli mentions, there is likely also an issue with
ordering of Heap_lock and JNI_critical_lock.
The GCLocker::is_active() loop is a bit better as it makes sure that all
threads left the JNI critical section and since we are in a safepoint,
when exiting the VM call these threads will block (and not be able to go
into another critical section).
So the full gc will likely be able to move away all objects from eden
regions.
However:
- this part of the code is bound to break with a change of G1 to pin
(eden) regions instead of using the gclocker (I am aware that it is very
g1 specific already).
One option to decrease the direct reliance on the GCLocker a bit could
be to have CollectedHeap::collect_as_vm_thread return a bool about
whether the collection could be executed (or there is a pinned eden
region?). Since the only reason to abort is the GCLocker for all STW gcs
at that point, this boils down to the same loop without directly using
the GCLocker here.
- one other issue I think there is with this piece code is that
safepoints/blocks for another reason while being in a critical section.
Then we have a deadlock.
This is prohibited by the JNI spec ("In a code segment enclosed by
GetRelease*Critical the native code must not ... or cause the current
thread to block"), but would be very hard to analyze so I recommend to
at least error out (or in this case continue with a suboptimal heap)
after too many retries, or at least start logging some warning that you
are stalled because of GCLocker being held.
Otherwise this error state will be too hard to understand and likely
very very hard to reproduce.
- this seems to be an optimization only if I understand the situation
correctly.
So Overall, for this change I would tend to suggest to only fix the bug,
i.e. remove the retries of any kind and so not try to optimize memory
layout here and think about this in an extra CR.
Thanks,
Thomas
More information about the hotspot-gc-dev
mailing list