RFR(S) 8245925 G1 allocates EDEN region after CDS has executed GC

Mon Jun 8 14:12:37 UTC 2020

Hi,

On 06.06.20 07:51, Ioi Lam wrote:
> 
> 
> On 6/5/20 2:38 PM, Jiangli Zhou wrote:
>> Hi Ioi,
>>
>> Thanks for the updated webrev.
>>
>> To avoid the assert, could you call GCLocker::stall_until_clear in
>> MetaspaceShared::preload_and_dump() before grabbing the 'Heap_lock'
>> and VMThread::execute(&op), like following. It is the best to check
>> with Thomas and other GC experts about the interaction between
>> 'Heap_lock' and 'GCLocker'.
>>
>>          if (HeapShared::is_heap_object_archiving_allowed()) {
>>            GCLocker::stall_until_clear();
>>          }
> Hi Jiangli,
> 
> Thanks for the review.
> 
> I don't think this is sufficient. Immediately after 
> GCLocker::stall_until_clear() has returned, another thread can invoke a 
> JNI function that will activate the GCLocker. That's stated by the 
> gcLocker.hpp comments:
> 
> http://hg.openjdk.java.net/jdk/jdk/file/882b61be2c19/src/hotspot/share/gc/shared/gcLocker.hpp#l109 

The GCLocker::stall_until_clear()/GClocker::is_active() loop seems to 
try to achieve something different than what the CR tries to fix.

The change without that loop/GClocker::stall_until_clear() already fully 
achieves that there is no allocation between compaction and the CDS dump 
(at least for STW collectors or collectors that can do a "full" 
collection during an STW pause.)

This additional code to wait for GCLocker being inactive to always get a 
contiguous range of memory is a different matter:

Using GCLocker::stall_until_clear() does not work: it would only wait 
for completion of all JNI critical sections if a gc had been requested 
while these JNI critical sections were active. Otherwise it would do 
nothing and let the caller continue, some threads still being in a 
critical section and the gclocker held. Which means that the following 
full gc may still "fail".

(That gc would set needs_gc and then it would work. But since you did 
not know whether the first full gc has been successful, you would need 
another one. Since at that point the return from the last JNI CS would 
trigger a young gc, you have two extra gcs in the common case...).

Another serious problem is that the stall_until_clear() code is written 
to be only called in a Java thread. I do not think it works in the VM 
thread without lots of more thought (and at least necessary changes to 
asserts). As Jiangli mentions, there is likely also an issue with 
ordering of Heap_lock and JNI_critical_lock.

The GCLocker::is_active() loop is a bit better as it makes sure that all 
threads left the JNI critical section and since we are in a safepoint, 
when exiting the VM call these threads will block (and not be able to go 
into another critical section).

So the full gc will likely be able to move away all objects from eden 
regions.

However:
- this part of the code is bound to break with a change of G1 to pin 
(eden) regions instead of using the gclocker (I am aware that it is very 
g1 specific already).

One option to decrease the direct reliance on the GCLocker a bit could 
be to have CollectedHeap::collect_as_vm_thread return a bool about 
whether the collection could be executed (or there is a pinned eden 
region?). Since the only reason to abort is the GCLocker for all STW gcs 
at that point, this boils down to the same loop without directly using 
the GCLocker here.

- one other issue I think there is with this piece code is that 
safepoints/blocks for another reason while being in a critical section. 
Then we have a deadlock.

This is prohibited by the JNI spec ("In a code segment enclosed by 
GetRelease*Critical the native code must not ... or cause the current 
thread to block"), but would be very hard to analyze so I recommend to 
at least error out (or in this case continue with a suboptimal heap) 
after too many retries, or at least start logging some warning that you 
are stalled because of GCLocker being held.
Otherwise this error state will be too hard to understand and likely 
very very hard to reproduce.

- this seems to be an optimization only if I understand the situation 
correctly.

So Overall, for this change I would tend to suggest to only fix the bug, 
i.e. remove the retries of any kind and so not try to optimize memory 
layout here and think about this in an extra CR.

Thanks,
   Thomas