RFR(S) 8245925 G1 allocates EDEN region after CDS has executed GC

Mon Jun 8 22:53:09 UTC 2020

Hi Ioi,

The latest webrev looks ok. Nit: for cleaner API, could you please
change VM_PopulateDumpSharedSpace::run_gc() to static
HepShared::run_gc()?

On Mon, Jun 8, 2020 at 3:22 PM Ioi Lam <ioi.lam at oracle.com> wrote:
>
>
>
> On 6/8/20 7:12 AM, Thomas Schatzl wrote:
> > Hi,
> >
> > On 06.06.20 07:51, Ioi Lam wrote:
> >>
> >>
> >> On 6/5/20 2:38 PM, Jiangli Zhou wrote:
> >>> Hi Ioi,
> >>>
> >>> Thanks for the updated webrev.
> >>>
> >>> To avoid the assert, could you call GCLocker::stall_until_clear in
> >>> MetaspaceShared::preload_and_dump() before grabbing the 'Heap_lock'
> >>> and VMThread::execute(&op), like following. It is the best to check
> >>> with Thomas and other GC experts about the interaction between
> >>> 'Heap_lock' and 'GCLocker'.
> >>>
> >>>          if (HeapShared::is_heap_object_archiving_allowed()) {
> >>>            GCLocker::stall_until_clear();
> >>>          }
> >> Hi Jiangli,
> >>
> >> Thanks for the review.
> >>
> >> I don't think this is sufficient. Immediately after
> >> GCLocker::stall_until_clear() has returned, another thread can invoke
> >> a JNI function that will activate the GCLocker. That's stated by the
> >> gcLocker.hpp comments:
> >>
> >> http://hg.openjdk.java.net/jdk/jdk/file/882b61be2c19/src/hotspot/share/gc/shared/gcLocker.hpp#l109
> >
> >
> > The GCLocker::stall_until_clear()/GClocker::is_active() loop seems to
> > try to achieve something different than what the CR tries to fix.
> >
> > The change without that loop/GClocker::stall_until_clear() already
> > fully achieves that there is no allocation between compaction and the
> > CDS dump (at least for STW collectors or collectors that can do a
> > "full" collection during an STW pause.)
> >
> > This additional code to wait for GCLocker being inactive to always get
> > a contiguous range of memory is a different matter:
> >
> > Using GCLocker::stall_until_clear() does not work: it would only wait
> > for completion of all JNI critical sections if a gc had been requested
> > while these JNI critical sections were active. Otherwise it would do
> > nothing and let the caller continue, some threads still being in a
> > critical section and the gclocker held. Which means that the following
> > full gc may still "fail".
> >
> > (That gc would set needs_gc and then it would work. But since you did
> > not know whether the first full gc has been successful, you would need
> > another one. Since at that point the return from the last JNI CS would
> > trigger a young gc, you have two extra gcs in the common case...).
> >
> > Another serious problem is that the stall_until_clear() code is
> > written to be only called in a Java thread. I do not think it works in
> > the VM thread without lots of more thought (and at least necessary
> > changes to asserts). As Jiangli mentions, there is likely also an
> > issue with ordering of Heap_lock and JNI_critical_lock.
> >
> > The GCLocker::is_active() loop is a bit better as it makes sure that
> > all threads left the JNI critical section and since we are in a
> > safepoint, when exiting the VM call these threads will block (and not
> > be able to go into another critical section).
> >
> > So the full gc will likely be able to move away all objects from eden
> > regions.
> >
> > However:
> > - this part of the code is bound to break with a change of G1 to pin
> > (eden) regions instead of using the gclocker (I am aware that it is
> > very g1 specific already).
> >
> > One option to decrease the direct reliance on the GCLocker a bit could
> > be to have CollectedHeap::collect_as_vm_thread return a bool about
> > whether the collection could be executed (or there is a pinned eden
> > region?). Since the only reason to abort is the GCLocker for all STW
> > gcs at that point, this boils down to the same loop without directly
> > using the GCLocker here.
> >
> > - one other issue I think there is with this piece code is that
> > safepoints/blocks for another reason while being in a critical
> > section. Then we have a deadlock.
> >
> > This is prohibited by the JNI spec ("In a code segment enclosed by
> > GetRelease*Critical the native code must not ... or cause the current
> > thread to block"), but would be very hard to analyze so I recommend to
> > at least error out (or in this case continue with a suboptimal heap)
> > after too many retries, or at least start logging some warning that
> > you are stalled because of GCLocker being held.
> > Otherwise this error state will be too hard to understand and likely
> > very very hard to reproduce.
> >
> > - this seems to be an optimization only if I understand the situation
> > correctly.
> >
> > So Overall, for this change I would tend to suggest to only fix the
> > bug, i.e. remove the retries of any kind and so not try to optimize
> > memory layout here and think about this in an extra CR.
> >
> > Thanks,
> >   Thomas
> Hi Thomas & Jiangli,
>
> Thanks for the comments. I have created an updated webrev:
>
> http://cr.openjdk.java.net/~iklam/jdk15/8245925-g1-eden-after-cds-gc.v04/
>
> Now the code works the same way as jcmd GC.heap_dump, which also check
> for GCLocker before calling collect_as_vm_thread():
>
> http://hg.openjdk.java.net/jdk/jdk/file/523a504bb062/src/hotspot/share/gc/shared/gcVMOperations.cpp#l124
>
> Question to Thomas -- is it necessary to call
> Universe::heap()->ensure_parsability(false)?
>
> Have removed the loop and just print a warning message that the archived
> heap may be sub-optimal.
>
> Jiangli> Could you please describe the specific issue that you want to
> address with GCLocker?
>
> With the current code base, we execute very simple Java code during
> -Xshare:dump, so after all classes are loaded, GCLocker should not be
> active. As noted in the code comments, GCLocker will be active only in
> the unlikely scenario where the Java code in core lib has been modified
> to do some sort of clean up that involves JNI code.
>

Ok, thanks. It would involve JNI critical, agreed that it's unlikely
for static CDS dumping.

> What do you think?
>
> Thanks
> - Ioi
>
>
> ======
> [PS] I am adding an assert like this to run tier1-4 in mach5 to see if
> this can happen.
>
> +  assert(!GCLocker::is_active(), "huh");
>     if (GCLocker::is_active()) {
>
>

Thanks, that will give some confidence before we can help do more
tests for this.

Best,
Jiangli

>
>
>
>
>