RFR(S): 8215893: Add better abstraction for pinning G1 concurrent marking bitmaps.

Thu Oct 17 21:28:10 UTC 2019

Hi Stefan,

> -----Original Message-----
> From: Stefan Johansson [mailto:stefan.johansson at oracle.com]
> Sent: Thursday, October 17, 2019 4:34 AM
> To: Kharbas, Kishor <kishor.kharbas at intel.com>; sangheon.kim at oracle.com
> Cc: hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S): 8215893: Add better abstraction for pinning G1
> concurrent marking bitmaps.
> 
> Hi Kishor,
> 
> On 2019-10-17 03:39, Kharbas, Kishor wrote:
> > Hi Sangheon,
> >
> > *From:*sangheon.kim at oracle.com [mailto:sangheon.kim at oracle.com]
> > *Sent:* Wednesday, October 16, 2019 11:03 AM
> > *To:* Kharbas, Kishor <kishor.kharbas at intel.com>
> > *Cc:* hotspot-gc-dev at openjdk.java.net; Stefan Johansson
> > <stefan.johansson at oracle.com>
> > *Subject:* Re: RFR(S): 8215893: Add better abstraction for pinning G1
> > concurrent marking bitmaps.
> >
> >> Hi Kishor,
> >>
> >> Before reviewing webrev.02, could you remind us what was the
> >> motivation of pinning the bitmap mappers here?
> >> In addition to explanations of the problematic situation, any logs /
> >> stack-trace also may help.
> >>
> >> We think that understanding of the root cause should be considered first.
> >
> > Unfortunately, I do not have log/stack-trace of the problem I had faced.
> >
> > I am trying to reproduce it by running SPECjbb workload over and over
> again.
> >
> > I haven't looked at GC code since end of last year. So I am having a
> > difficult time pinning what the problem was.
> >
> > I am looking at G1ClearBitMapTask which iterates over bitmap for all
> > available regions. I am not sure when this task is performed.
> >
> > There is comment in HeapRegionManager::par_iterate() as shown below,
> >
> > /// This also (potentially) iterates over regions newly allocated
> > during GC. This/
> >
> > /  // is no problem except for some extra work./
> >
> > This method is eventually called from G1ClearBitMapTask. The comment
> > suggests that regions are allocated concurrently when the function is
> > run. This also means with AllocateOldGenAt flag enabled, regions can
> > also be un-committed.
> 
> I don't understand how AllocateOldGenAt would make any difference,
> regions can be un-committed without it as well and there are mechanisms in
> place to make sure only the correct parts of the side structures are un-
> committed when that happens.

In the regular code un-commit is only done by VM thread during safepoint. Un-commit of region also causes its corresponding bitmap to be un-committed.
But it never happens that CM threads are iterating over bitmap while regions are being un-committed concurrently.

Whereas when AllocateOldGenAt is used, because of the way regions are managed between
dram and nvdimms, regions can be un-committed by mutator threads and GC threads.
1. Mutator threads - during mutator region allocation and humongous region allocation.
2. GC worker threads - during survivor region and old region allocation.
3. VMThread - heap size adjustment as in default and after full GC to allocate enough regions in dram for young gen (may require to un-commit some regions from nvdimm).

Could any of these be running concurrently when CM threads are iterating over the bitmap?

> 
> I want to reiterate what Sangheon said about identifying the root cause.
> If we don't know why this is needed and can't reproduce any failures without
> the special pinning of the bitmaps, I would rather see that we remove the
> pinning code to make things work more like normal G1.

I am trying to reproduce but as you can imagine it is very rare and hard-to-reproduce bug, if it is.

Thanks,
Kishor
> 
> Thanks,
> Stefan
> 
> 
> >
> > Pardon me if my understanding is incorrect.
> >
> > Regards,
> >
> > Kishor