RFR(S): 8215893: Add better abstraction for pinning G1 concurrent marking bitmaps.

Tue Oct 22 07:22:59 UTC 2019

Hi Stefan,

> -----Original Message-----
> From: Stefan Johansson [mailto:stefan.johansson at oracle.com]
> Sent: Friday, October 18, 2019 7:32 AM
> To: Kharbas, Kishor <kishor.kharbas at intel.com>
> Cc: sangheon.kim at oracle.com; hotspot-gc-dev at openjdk.java.net
> Subject: Re: RFR(S): 8215893: Add better abstraction for pinning G1
> concurrent marking bitmaps.
> 
> Hi Kishor,
> 
> > 17 okt. 2019 kl. 23:28 skrev Kharbas, Kishor <kishor.kharbas at intel.com>:
> >
> > Hi Stefan,
> >
> >> -----Original Message-----
> >> From: Stefan Johansson [mailto:stefan.johansson at oracle.com]
> >> Sent: Thursday, October 17, 2019 4:34 AM
> >> To: Kharbas, Kishor <kishor.kharbas at intel.com>;
> >> sangheon.kim at oracle.com
> >> Cc: hotspot-gc-dev at openjdk.java.net
> >> Subject: Re: RFR(S): 8215893: Add better abstraction for pinning G1
> >> concurrent marking bitmaps.
> >>
> >> Hi Kishor,
> >>
> >> On 2019-10-17 03:39, Kharbas, Kishor wrote:
> >>> Hi Sangheon,
> >>>
> >>> *From:*sangheon.kim at oracle.com [mailto:sangheon.kim at oracle.com]
> >>> *Sent:* Wednesday, October 16, 2019 11:03 AM
> >>> *To:* Kharbas, Kishor <kishor.kharbas at intel.com>
> >>> *Cc:* hotspot-gc-dev at openjdk.java.net; Stefan Johansson
> >>> <stefan.johansson at oracle.com>
> >>> *Subject:* Re: RFR(S): 8215893: Add better abstraction for pinning
> >>> G1 concurrent marking bitmaps.
> >>>
> >>>> Hi Kishor,
> >>>>
> >>>> Before reviewing webrev.02, could you remind us what was the
> >>>> motivation of pinning the bitmap mappers here?
> >>>> In addition to explanations of the problematic situation, any logs
> >>>> / stack-trace also may help.
> >>>>
> >>>> We think that understanding of the root cause should be considered
> first.
> >>>
> >>> Unfortunately, I do not have log/stack-trace of the problem I had faced.
> >>>
> >>> I am trying to reproduce it by running SPECjbb workload over and
> >>> over
> >> again.
> >>>
> >>> I haven't looked at GC code since end of last year. So I am having a
> >>> difficult time pinning what the problem was.
> >>>
> >>> I am looking at G1ClearBitMapTask which iterates over bitmap for all
> >>> available regions. I am not sure when this task is performed.
> >>>
> >>> There is comment in HeapRegionManager::par_iterate() as shown
> below,
> >>>
> >>> /// This also (potentially) iterates over regions newly allocated
> >>> during GC. This/
> >>>
> >>> /  // is no problem except for some extra work./
> >>>
> >>> This method is eventually called from G1ClearBitMapTask. The comment
> >>> suggests that regions are allocated concurrently when the function
> >>> is run. This also means with AllocateOldGenAt flag enabled, regions
> >>> can also be un-committed.
> >>
> >> I don't understand how AllocateOldGenAt would make any difference,
> >> regions can be un-committed without it as well and there are
> >> mechanisms in place to make sure only the correct parts of the side
> >> structures are un- committed when that happens.
> >
> > In the regular code un-commit is only done by VM thread during safepoint.
> Un-commit of region also causes its corresponding bitmap to be un-
> committed.
> > But it never happens that CM threads are iterating over bitmap while
> regions are being un-committed concurrently.
> >
> > Whereas when AllocateOldGenAt is used, because of the way regions are
> > managed between dram and nvdimms, regions can be un-committed by
> mutator threads and GC threads.
> > 1. Mutator threads - during mutator region allocation and humongous
> region allocation.
> 
> This is the problem, I managed to reproduce this by adding a short sleep in
> the clearing code and force back to back concurrent cycles in SPECjvm2008
> and a 2g heap. I think this is only a problem for humongous allocations,
> because we should never allocate more young regions than we have already
> made available at the end of the previous GC. But the humongous allocations
> can very well happen during we clear the bitmaps in the concurrent cycle so
> that is probably why the pinning was added.
> 
> Thinking more about this, a different solution would be to not un-commit
> memory in this situation. This all depends on how one sees the amount of
> committed memory when using AllocateOldGenAt, should the amount of
> committed on dram + nvdimm never be more than Xmx or is the important
> thing that the number of regions use never exceeds Xmx. I think I’m leaning
> towards the latter, but there might be reasons I haven’t thought about here.
> This would break the current invariant:
> assert(total_committed_before == total_regions_committed(), "invariant
> not met”);
> 
> But that might be ok. If using that approach, instead of un-committing
> (shrink_dram), just remove the same number of regions from the freelist,
> that you expand on nvdimm. The unused removed regions need to be kept
> track of so we can add them again during the GC. To me this is more or less
> the same concept we use when borrowing regions during the GC. There
> might be issues with this approach but I think it would be interesting to
> explore.
> 
[Kharbas, Kishor] Thank you for looking into this and reproducing the bug. I think I follow
your suggestion. I will try to work on a solution using this.

> I also wonder if we ever should need to expand_dram during
> allocate_new_region, I see that it happens now during GC and that is
> probably because we do this at the end of the GC:
> _manager->adjust_dram_regions((uint)young_list_target_length() …
> 
> If this adjustment included the expected number of survivors as well, we
> should have enough DRAM regions and if we then end up getting an
> NVDIMM region when asking for a survivor we should return NULL signaling
> that survivor is full.
> 
> What do you think about that approach?

 [Kharbas, Kishor] This approach is simpler to implement. I am afraid that it would
change the behavior with respect to default case. Still I will give it a try.

For now, can we close the bug if the abstraction is satisfactory and continue exploration
in a separate issue?

Thanks,
Kishor

> 
> Thanks,
> Stefan
> 
> > 2. GC worker threads - during survivor region and old region allocation.
> > 3. VMThread - heap size adjustment as in default and after full GC to
> allocate enough regions in dram for young gen (may require to un-commit
> some regions from nvdimm).
> >
> > Could any of these be running concurrently when CM threads are iterating
> over the bitmap?
> >
> >>
> >> I want to reiterate what Sangheon said about identifying the root cause.
> >> If we don't know why this is needed and can't reproduce any failures
> >> without the special pinning of the bitmaps, I would rather see that
> >> we remove the pinning code to make things work more like normal G1.
> >
> > I am trying to reproduce but as you can imagine it is very rare and hard-to-
> reproduce bug, if it is.
> >
> > Thanks,
> > Kishor
> >>
> >> Thanks,
> >> Stefan
> >>
> >>
> >>>
> >>> Pardon me if my understanding is incorrect.
> >>>
> >>> Regards,
> >>>
> >>> Kishor