RFR(S) 8245925 G1 allocates EDEN region after CDS has executed GC

Mon Jun 8 22:21:17 UTC 2020

On 6/8/20 7:12 AM, Thomas Schatzl wrote:
> Hi,
>
> On 06.06.20 07:51, Ioi Lam wrote:
>>
>>
>> On 6/5/20 2:38 PM, Jiangli Zhou wrote:
>>> Hi Ioi,
>>>
>>> Thanks for the updated webrev.
>>>
>>> To avoid the assert, could you call GCLocker::stall_until_clear in
>>> MetaspaceShared::preload_and_dump() before grabbing the 'Heap_lock'
>>> and VMThread::execute(&op), like following. It is the best to check
>>> with Thomas and other GC experts about the interaction between
>>> 'Heap_lock' and 'GCLocker'.
>>>
>>>          if (HeapShared::is_heap_object_archiving_allowed()) {
>>>            GCLocker::stall_until_clear();
>>>          }
>> Hi Jiangli,
>>
>> Thanks for the review.
>>
>> I don't think this is sufficient. Immediately after 
>> GCLocker::stall_until_clear() has returned, another thread can invoke 
>> a JNI function that will activate the GCLocker. That's stated by the 
>> gcLocker.hpp comments:
>>
>> http://hg.openjdk.java.net/jdk/jdk/file/882b61be2c19/src/hotspot/share/gc/shared/gcLocker.hpp#l109 
>
>
> The GCLocker::stall_until_clear()/GClocker::is_active() loop seems to 
> try to achieve something different than what the CR tries to fix.
>
> The change without that loop/GClocker::stall_until_clear() already 
> fully achieves that there is no allocation between compaction and the 
> CDS dump (at least for STW collectors or collectors that can do a 
> "full" collection during an STW pause.)
>
> This additional code to wait for GCLocker being inactive to always get 
> a contiguous range of memory is a different matter:
>
> Using GCLocker::stall_until_clear() does not work: it would only wait 
> for completion of all JNI critical sections if a gc had been requested 
> while these JNI critical sections were active. Otherwise it would do 
> nothing and let the caller continue, some threads still being in a 
> critical section and the gclocker held. Which means that the following 
> full gc may still "fail".
>
> (That gc would set needs_gc and then it would work. But since you did 
> not know whether the first full gc has been successful, you would need 
> another one. Since at that point the return from the last JNI CS would 
> trigger a young gc, you have two extra gcs in the common case...).
>
> Another serious problem is that the stall_until_clear() code is 
> written to be only called in a Java thread. I do not think it works in 
> the VM thread without lots of more thought (and at least necessary 
> changes to asserts). As Jiangli mentions, there is likely also an 
> issue with ordering of Heap_lock and JNI_critical_lock.
>
> The GCLocker::is_active() loop is a bit better as it makes sure that 
> all threads left the JNI critical section and since we are in a 
> safepoint, when exiting the VM call these threads will block (and not 
> be able to go into another critical section).
>
> So the full gc will likely be able to move away all objects from eden 
> regions.
>
> However:
> - this part of the code is bound to break with a change of G1 to pin 
> (eden) regions instead of using the gclocker (I am aware that it is 
> very g1 specific already).
>
> One option to decrease the direct reliance on the GCLocker a bit could 
> be to have CollectedHeap::collect_as_vm_thread return a bool about 
> whether the collection could be executed (or there is a pinned eden 
> region?). Since the only reason to abort is the GCLocker for all STW 
> gcs at that point, this boils down to the same loop without directly 
> using the GCLocker here.
>
> - one other issue I think there is with this piece code is that 
> safepoints/blocks for another reason while being in a critical 
> section. Then we have a deadlock.
>
> This is prohibited by the JNI spec ("In a code segment enclosed by 
> GetRelease*Critical the native code must not ... or cause the current 
> thread to block"), but would be very hard to analyze so I recommend to 
> at least error out (or in this case continue with a suboptimal heap) 
> after too many retries, or at least start logging some warning that 
> you are stalled because of GCLocker being held.
> Otherwise this error state will be too hard to understand and likely 
> very very hard to reproduce.
>
> - this seems to be an optimization only if I understand the situation 
> correctly.
>
> So Overall, for this change I would tend to suggest to only fix the 
> bug, i.e. remove the retries of any kind and so not try to optimize 
> memory layout here and think about this in an extra CR.
>
> Thanks,
>   Thomas
Hi Thomas & Jiangli,

Thanks for the comments. I have created an updated webrev:

http://cr.openjdk.java.net/~iklam/jdk15/8245925-g1-eden-after-cds-gc.v04/

Now the code works the same way as jcmd GC.heap_dump, which also check 
for GCLocker before calling collect_as_vm_thread():

http://hg.openjdk.java.net/jdk/jdk/file/523a504bb062/src/hotspot/share/gc/shared/gcVMOperations.cpp#l124

Question to Thomas -- is it necessary to call 
Universe::heap()->ensure_parsability(false)?

Have removed the loop and just print a warning message that the archived 
heap may be sub-optimal.

Jiangli> Could you please describe the specific issue that you want to 
address with GCLocker?

With the current code base, we execute very simple Java code during 
-Xshare:dump, so after all classes are loaded, GCLocker should not be 
active. As noted in the code comments, GCLocker will be active only in 
the unlikely scenario where the Java code in core lib has been modified 
to do some sort of clean up that involves JNI code.

What do you think?

Thanks
- Ioi

======
[PS] I am adding an assert like this to run tier1-4 in mach5 to see if 
this can happen.

+  assert(!GCLocker::is_active(), "huh");
    if (GCLocker::is_active()) {