RFR 8087198: G1 card refinement: batching, sorting, joining, prefetching

Mon Jun 15 11:11:24 UTC 2015

Hi,

  some random comments :)

On Thu, 2015-06-11 at 11:20 +0000, Erik Österlund wrote:
> Hi guys,
> 
> I have been experimenting a bit with G1 card refinement. Thought I¹d check
> if anyone finds this interesting. ;)

Sure interesting even if only for the refactoring/cleaning up/batching the
memory barriers part.

> Basically we currently process cards one by one even though they always
> come in groups of at the very least G1UpdateBufferSize. The cards may or
> may not follow a particularly cache friendly order, depending on how lucky
> we are, and need to do things like seek the start of the card quite often,
> as well as cutting arrays at the card intervals, even though many times
> the batches of cards could be linearly scanned and follow a contiguous
> order.

The question is, is this true? I remember doing some experiments on that
years ago, with different applications, and it did not seem to be worth
doing because the cards were not contiguous. Iirc there were almost always
just very little consecutive runs of cards (within a complete buffer) that did
not amount to much.
So sorting seemed more expensive than the gain due to batching them for
reference analysis.

I remember more recent experiments on the refinement buffers generated by
the evacuation (in G1 remembered set updates during GC are not immediately
reflected in the remembered set, but G1 creates refinement entries instead),
and there the situation seemed even worse.

There is always the advantage of batching them to do less memory barriers
like CMS does.

> What my patch does is to first clean all the G1UpdateBufferSize cards and

This is just a comment from briefly looking at the change: I would have
first only sorted and merged, grouped into regions, and then applied the
various (mostly region-based filters) on the resulting MemRegions.

While it is possible to do unnecessary work here, the region-based
filtering seem to not apply very often, mostly covering races. Resulting
in a net win if we can reduce the cards to a few MemRegions (and that
filtering would be applied less often).

But I do not really know what is better.

> potentially swap some with the hot card cache. Then after that, the cards
> are sorted in address order for nicer cache interactions.

> Then the cards
> are processed 16 at a time. The cards are joined together if consecutive,
> and intervals with re-dirtied cards are split so that finding the start of
> the card does not happen so often.

The buffering and chunking itself (and the code refactoring) may help on its
own.

>  The start of the next scanning interval
> is prefetched while the current interval is scanned for references.

I am not sure prefetching helps a lot if I understand correctly, considering
that the amount of work done even within a single card is pretty large. I.e.
there is a certain risk that prefetching just fills up hardware resources,
e.g. load buffers. But I do not know :)

> Webrev with demo:
> http://cr.openjdk.java.net/~eosterlund/g1_experiments/card_refinement/webre
> v.00/
> 
> RFE ID:
> https://bugs.openjdk.java.net/browse/JDK-8087198
> 
> In the DaCapo benchmarks I could find for instance 3.2% speedup in both fop
> (configured to achieve 1 ms latency with a small 64M heap) and h2 3%
> speedup running default settings on 512M heap. The measurements were taken
> after 40 warmup iterations, and then sampled the average of 10 iterations.
> 
> If somebody is interested in sponsoring something like this, it would be
> interesting to see how this performs in your favourite benchmarks. I
> suspect this kind of stuff is
> more important when you deal with larger heaps.

I will give it a try on some other benchmarks. I totally unscientifically
tried it on the example program in JDK-8062128 (10 runs each with and
without the changes), and that seemed to improve ~2.5%, barely statistically
significant at alpha=0.05.
I expect this to be more like an upper bound in improvement.

Thanks,
  Thomas