RFR (M): 8077144: Concurrent mark initialization takes too long

Thomas Schatzl thomas.schatzl at oracle.com
Tue Mar 15 22:12:41 UTC 2016


Hi Mikael,

  updated webrev at
http://cr.openjdk.java.net/~tschatzl/8077144/webrev.3/ (full)
http://cr.openjdk.java.net/~tschatzl/8077144/webrev.2_to_3/ (diff)

which implements the suggested changes.

It makes the change a lot nicer, thanks for the suggestion. There are
still possibilities for improvements. I would prefer to do larger
refactorings (particularly to improve testability/code quality and add
test cases) after moving all this related code into a separate
class/file, which will be next in JDK-8151386 and maybe one or two
follow-ups.
With that in place, this liveness information can then also be used
with fully coarsened per-region tables in remset.

Testing:
jprt, vm.gc with -XX:+VerifyDuringGC, local checking of marked_bytes
values.

Thanks,
  Thomas

On Mon, 2016-03-14 at 16:37 +0100, Mikael Gerdin wrote:
> Hi,
> 
> I had an IM discussion with Thomas around some issues with the design
> of 
> the aggregation and verification code.
> 
> G1LiveDataClosureBase should become a utlility class instead of a
> base 
> class. It's weird that the G1VerifyLiveDataHRClosure embeds another 
> HRClosure and calls doHeapRegion on it, it would be better if the 
> mark_marked_during_marking and mark_allocated_since_marking methods 
> could be called right away.
> 
> Mark_marked_during_marking should return the marked bytes instead and
> let the caller take care of mutating the heap region and doing the
> yield 
> check. There is currently a bug where the verification code calls 
> add_to_marked_bytes on the HeapRegion. There is also an issue with
> how 
> add_to_marked_bytes is called on HumongousContinues regions since 
> marked_bytes is aggregated in each iteration and then the aggregate
> is 
> added to the current hr.
> 
> It might be a good idea to hold off on reviews until the updated
> webrev 
> is out.
> 
> /Mikael
> 
> On 2016-03-14 14:15, Thomas Schatzl wrote:
> > Hi all,
> > 
> >    could I have reviews for this from-scratch solution for the
> > problem
> > that G1 startup takes too long?
> > 
> > Current G1 uses per-mark thread liveness mark bitmaps that span the
> > entire heap to be ultimately able to create information about areas
> > in
> > the heap where there are any live objects on a card basis.
> > This information is needed for scrubbing remembered sets later.
> > 
> > Basically, in addition to updating the previous bitmap required for
> > SATB, the marking threads also, for every live object, mark all
> > bits
> > corresponding to the area the object covers on a per thread basis
> > on
> > these per-thread liveness mark bitmaps.
> > 
> > During the remark pause, this information is aggregated into (two)
> > global bitmaps ("Liveness Count Data"), then in the cleanup pause
> > augmented with some more liveness information, and then used for
> > scrubbing the remembered sets.
> > 
> > The main problems with that solution:
> > 
> > - the per-mark thread data structures take up a lot of space. E.g.
> > with
> > 64 mark threads, this data structure has the same size of the Java
> > heap. Now, when you need to use 60 mark threads, the heap is big.
> > And
> > at those heap sizes, needing that much more memory hurts a lot.
> > 
> > - management of these additional data structures is costly, it
> > takes a
> > long time to initialize, and regularly clear them. The increased
> > startup time has actually been the cause for this issue.
> > 
> > - it takes a significant amount of time to aggregate this data in
> > the
> > remark pause.
> > 
> > - it slows down marking, the combined bitmap update (the prev
> > bitmap
> > and these per-thread bitmaps) is slower than doing these phases
> > seperately.
> > 
> > This proposed solution removes the per-thread additional mark
> > bitmaps,
> > and recreates this information from the (complete) prev bitmap in
> > an
> > extra concurrent phase after the Remark pause.
> > 
> > This can be done since the Prev bitmap does not change after Remark
> > any
> > more.
> > 
> > In total, this separation of the tasks is faster (lowers concurrent
> > cycle time) than doing this work at once for the following reasons:
> > 
> >    - I did not observe any throughput regresssions with this
> > change:
> > actually, throughput of some large applications even increases with
> > that change (not taking into account that you could increase heap
> > size
> > now since not so much is taken up by these additional bitmaps).
> > 
> >    - the concurrent phase to prepare for the next marking is much
> > shorter now, since we do not need to clear lots of memory any more.
> > 
> >    - the remark pause can be much faster (I have measurements of a
> > decrease in the order of a magnitude on large applications, where
> > this
> > aggregation phase dominates the remark pause).
> > 
> >    - startup and footprint naturally decreases significantly,
> > particularly on large systems.
> > 
> > As a nice side-effect, the change in effect removes a significant
> > amount of LOC.
> > 
> > There is a follow-up change to move (and later clean up) the still
> > remaining data structures required for scrubbing into extra
> > classes,
> > since they will be used more cleverly in the future (JDK-8151386).
> > 
> > There will be another follow-up change without CR yet to fix the
> > use of
> > an excessive amount of parallel gc threads for clearing the
> > liveness
> > count data.
> > 
> > The change is based on JDK-8151614, JDK-8151126 (I do not think it
> > conflicts with that actually), and JDK-8151534 (array allocator
> > refactoring).
> > 
> > CR:
> > https://bugs.openjdk.java.net/browse/JDK-8077144
> > Webrev:
> > http://cr.openjdk.java.net/~tschatzl/8077144/webrev.2/
> > Testing:
> > jprt, vm.gc, kitchensink, some perf benchmarks
> > 
> > Thanks,
> >    Thomas
> > 



More information about the hotspot-gc-dev mailing list