RFR (M): 8077144: Concurrent mark initialization takes too long

Mon Mar 14 13:15:34 UTC 2016

Hi all,

  could I have reviews for this from-scratch solution for the problem
that G1 startup takes too long?

Current G1 uses per-mark thread liveness mark bitmaps that span the
entire heap to be ultimately able to create information about areas in
the heap where there are any live objects on a card basis.
This information is needed for scrubbing remembered sets later.

Basically, in addition to updating the previous bitmap required for
SATB, the marking threads also, for every live object, mark all bits
corresponding to the area the object covers on a per thread basis on
these per-thread liveness mark bitmaps.

During the remark pause, this information is aggregated into (two)
global bitmaps ("Liveness Count Data"), then in the cleanup pause
augmented with some more liveness information, and then used for
scrubbing the remembered sets.

The main problems with that solution:

- the per-mark thread data structures take up a lot of space. E.g. with
64 mark threads, this data structure has the same size of the Java
heap. Now, when you need to use 60 mark threads, the heap is big. And
at those heap sizes, needing that much more memory hurts a lot.

- management of these additional data structures is costly, it takes a
long time to initialize, and regularly clear them. The increased
startup time has actually been the cause for this issue.

- it takes a significant amount of time to aggregate this data in the
remark pause.

- it slows down marking, the combined bitmap update (the prev bitmap
and these per-thread bitmaps) is slower than doing these phases
seperately.

This proposed solution removes the per-thread additional mark bitmaps,
and recreates this information from the (complete) prev bitmap in an
extra concurrent phase after the Remark pause.

This can be done since the Prev bitmap does not change after Remark any
more.

In total, this separation of the tasks is faster (lowers concurrent
cycle time) than doing this work at once for the following reasons:

  - I did not observe any throughput regresssions with this change:
actually, throughput of some large applications even increases with
that change (not taking into account that you could increase heap size
now since not so much is taken up by these additional bitmaps).

  - the concurrent phase to prepare for the next marking is much
shorter now, since we do not need to clear lots of memory any more.

  - the remark pause can be much faster (I have measurements of a
decrease in the order of a magnitude on large applications, where this
aggregation phase dominates the remark pause).

  - startup and footprint naturally decreases significantly,
particularly on large systems.

As a nice side-effect, the change in effect removes a significant
amount of LOC.

There is a follow-up change to move (and later clean up) the still
remaining data structures required for scrubbing into extra classes,
since they will be used more cleverly in the future (JDK-8151386).

There will be another follow-up change without CR yet to fix the use of
an excessive amount of parallel gc threads for clearing the liveness
count data.

The change is based on JDK-8151614, JDK-8151126 (I do not think it
conflicts with that actually), and JDK-8151534 (array allocator
refactoring).

CR:
https://bugs.openjdk.java.net/browse/JDK-8077144
Webrev:
http://cr.openjdk.java.net/~tschatzl/8077144/webrev.2/
Testing:
jprt, vm.gc, kitchensink, some perf benchmarks

Thanks,
  Thomas