RFR (XXL): 8213108: Improve work distribution during remembered set scan

Fri May 24 12:58:22 UTC 2019

Hi all,

  can I have reviews for this change to the way we scan cards from the
various heap roots (remembered sets, log buffers, hcc) during garbage
collection to improve performance?

Instead of iterating over remembered sets of all regions in the current
collection set increment, the threads trying to claim batches of cards,
and only just before scanning a card try to detect whether we already
scanned that card, first create a combined set of heap roots (i.e.
cards to scan), and do work distribution/iteration over them by
claiming areas within this combined set of heap roots.

This is similar to other collectors that just scan the old gen card
table. However G1 can do a little better, as we also collect
approximate location of cards to scan (like in a second-level card
table), and skip large areas guaranteed to not contain any
interesting cards.

Further, this allows G1 to more easily ignore them further on.

Implementation wise, this combined heap roots set is materialized on
the card table at this moment. This has the advantage that the card
table is already allocated anyway, it is easy to modify concurrently,
and the change itself does not add any overhead in clearing it later
too. I.e. the number of regions to clean at the end is the same as
before.

This speeds up garbage collection in most cases significantly:
- specjbb2015 critical-jops are improved by >12% in our testing in some
setups
- on the BigRAMTester microbenchmark[0], stressing remembered set
scanning, maximum pause times are reduced by 40%+ for mixed gcs (i.e.
almost halving these pause times), and 20% for normal young gcs.

There are unfortunately some situations where the added, required heap
roots merging phase will cause some significant regressions by just
being there.
This are the cases when total pause time is already very low (3-5ms),
and there is not much to do at all during that root merging, still
taking 0.1-0.3ms mainly to spin up and tear down worker threads.
There are plans to fix this by e.g. doing pre-merging of parts of the
heap roots for young-only collection.
However the current remembered set implementation inhibits this; I plan
to fix this shortcoming later in follow-ups of a remembered set data
structure rewrite ([1]).
I believe the change is worth this small (in absolute terms) regression
for cases that other collectors might handle better anyway at this
time, because as soon as pause times are in the range of 10ms (or even
lower if there are *some* heap roots), and there are actually some heap
roots to merge, the additional phase quickly amortizes itself.
The change not only improves if you run the VM with many threads, but 
also some low-thread count runs showed similar improvements vs. without
the change.

This change significantly alters the log output for the garbage
collection; there is a comment in the CR [2] that describes it in
detail.

For reviewing, I recommend reading the comment at the start of
g1RemSet.cpp [3] to get acquainted with the algorithm and the new
terminology I used. I will prepare a release note about this.

There will be some minor follow-up changes that micro-optimize the code
a bit (e.g. [4]). They were separated to keep complexity down a bit.

I would really like to get this into jdk13 if possible. :)

CR:
https://bugs.openjdk.java.net/browse/JDK-8213108
Webrev:
http://cr.openjdk.java.net/~tschatzl/8213108/webrev/
Testing:
hs-tier1-5(many times), hs-tier6-8, many perf test runs

Thanks,
  Thomas

[0] https://bugs.openjdk.java.net/browse/JDK-8152438
[1] https://bugs.openjdk.java.net/browse/JDK-8017163
[2] 
https://bugs.openjdk.java.net/browse/JDK-8213108?focusedCommentId=14266190&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14266190
[3] 
http://cr.openjdk.java.net/~tschatzl/8213108/webrev/src/hotspot/share/gc/g1/g1RemSet.cpp.frames.html
[4] https://bugs.openjdk.java.net/browse/JDK-8224741