RFR 8087198: G1 card refinement: batching, sorting, joining, prefetching

Mon Jun 15 15:08:12 UTC 2015

Hi Thomas,

Thanks for looking at my patch!

Den 15/06/15 04:11 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:

>Hi,
>
>  some random comments :)
>
>On Thu, 2015-06-11 at 11:20 +0000, Erik Österlund wrote:
>> Hi guys,
>> 
>> I have been experimenting a bit with G1 card refinement. Thought I¹d
>>check
>> if anyone finds this interesting. ;)
>
>Sure interesting even if only for the refactoring/cleaning up/batching the
>memory barriers part.

Thanks! I forgot to mention some of the cleaning up stuff that comes as a
bonus kind of. Like for instance the closure used for refinement was
passed around everywhere and globally accessible from the G1 heap, for a
single reason: forwarding a getter and setter of whether concurrency is
activated from the G1 heap to the closure. This seemed really messed up to
me so I made it a flag of the G1 heap instead, removed the closure as a
global variable, and instead let the closure ask the G1 heap for this
flag. I don’t know why it was done this way but I just assumed it was a
mistake and fixed it while fixing stuff. Also I need the closures to have
private state for buffering.

>
>> Basically we currently process cards one by one even though they always
>> come in groups of at the very least G1UpdateBufferSize. The cards may or
>> may not follow a particularly cache friendly order, depending on how
>>lucky
>> we are, and need to do things like seek the start of the card quite
>>often,
>> as well as cutting arrays at the card intervals, even though many times
>> the batches of cards could be linearly scanned and follow a contiguous
>> order.
>
>The question is, is this true? I remember doing some experiments on that
>years ago, with different applications, and it did not seem to be worth
>doing because the cards were not contiguous. Iirc there were almost always
>just very little consecutive runs of cards (within a complete buffer)
>that did
>not amount to much.
>So sorting seemed more expensive than the gain due to batching them for
>reference analysis.

I don’t know if it is always true that intervals are contiguous. In some
applications they are (yes I checked, for instance h2 of Dacapo has almost
only contiguous cards from humongous arrays that seem to be sweeped over)
which is sweet.

And when they are not and this is not true and intervals are broken - I
use prefetching instead. So regardless of the case, caching interactions
are improved.

I benchmarked the cost of the sorting and could not find a difference at
all. I find it very cheap as all the data is local to the closure and L1
cached. The cost however of missing cache lines when scanning cards when
they could have been contiguous intervals… ouch. (I know, I’m
exaggerating. But still…!)

>I remember more recent experiments on the refinement buffers generated by
>the evacuation (in G1 remembered set updates during GC are not immediately
>reflected in the remembered set, but G1 creates refinement entries
>instead),
>and there the situation seemed even worse.

My patch currently only applies to card refinement by the concurrent
refinement threads and mutators. I didn’t touch the code for scanning the
cards during evacuation pauses. But I expect similar stuff can be done
there.

Also, maybe there could be a flag for reflecting the updates immediately
so the work doesn’t have to be done, for users that care more about
optimizing performance than for having perfect latencies. Do you happen to
know how slow the updates to the remembered sets are compared to the cost
of scanning the cards roughly speaking? Curious where the bottlenecks are
so I know where to focus optimizations.

>There is always the advantage of batching them to do less memory barriers
>like CMS does.

Yeah, I also experimented with some ADS tricks that allowed me to remove
the G1 write barrier fences and joining 2 conditional jumps into 1, which
led to further nice speed ups on my machine. But Dave Dice told me fences
are becoming really cheap and ADS doesn’t scale well on multiple socket
machines so I left that out of the patch, assuming my old macbook isn’t
representative of what customers will have in a few years time. :p

So even this amortized fencing may or may not actually matter in a few
years, but I couldn’t help not to do it because it felt wrong to fence
after cleaning every card instead of once for the whole batch of cards. My
fence phobia kicked in!

>
>> What my patch does is to first clean all the G1UpdateBufferSize cards
>>and
>
>This is just a comment from briefly looking at the change: I would have
>first only sorted and merged, grouped into regions, and then applied the
>various (mostly region-based filters) on the resulting MemRegions.
>
>While it is possible to do unnecessary work here, the region-based
>filtering seem to not apply very often, mostly covering races. Resulting
>in a net win if we can reduce the cards to a few MemRegions (and that
>filtering would be applied less often).
>
>But I do not really know what is better.

I think you might be right if I understand you correctly. It was a bit
more awkward to extract from the current code though as it first does
region checks and then exchanges cards with the hot card cache, meaning
that the cards sent in are not the ones you get back out. This makes such
a change maybe possible but a bit awkward. Because if we preprocess the
card intervals first to reduce the time for region checks and then get
other cards back from the cache, we have to process them again. Meh! So I
didn’t do it for now, but maybe it’s possible if rearranging stuff a bit.

>
>> potentially swap some with the hot card cache. Then after that, the
>>cards
>> are sorted in address order for nicer cache interactions.
>
>> Then the cards
>> are processed 16 at a time. The cards are joined together if
>>consecutive,
>> and intervals with re-dirtied cards are split so that finding the start
>>of
>> the card does not happen so often.
>
>The buffering and chunking itself (and the code refactoring) may help on
>its
>own.

:)

>
>>  The start of the next scanning interval
>> is prefetched while the current interval is scanned for references.
>
>I am not sure prefetching helps a lot if I understand correctly,
>considering
>that the amount of work done even within a single card is pretty large.
>I.e.
>there is a certain risk that prefetching just fills up hardware resources,
>e.g. load buffers. But I do not know :)

I tried with and without prefetching it was faster with prefetching in my
experiments. The reasoning is that it needs to load this data anyway,
there is no getting around that. So it might as well prefetch it so at
least it doesn’t have to stall.

>
>> Webrev with demo:
>> 
>>http://cr.openjdk.java.net/~eosterlund/g1_experiments/card_refinement/web
>>re
>> v.00/
>> 
>> RFE ID:
>> https://bugs.openjdk.java.net/browse/JDK-8087198
>> 
>> In the DaCapo benchmarks I could find for instance 3.2% speedup in both
>>fop
>> (configured to achieve 1 ms latency with a small 64M heap) and h2 3%
>> speedup running default settings on 512M heap. The measurements were
>>taken
>> after 40 warmup iterations, and then sampled the average of 10
>>iterations.
>> 
>> If somebody is interested in sponsoring something like this, it would be
>> interesting to see how this performs in your favourite benchmarks. I
>> suspect this kind of stuff is
>> more important when you deal with larger heaps.
>
>I will give it a try on some other benchmarks. I totally unscientifically
>tried it on the example program in JDK-8062128 (10 runs each with and
>without the changes), and that seemed to improve ~2.5%, barely
>statistically
>significant at alpha=0.05.
>I expect this to be more like an upper bound in improvement.

2.5% is a good start eh! Thanks Thomas for taking my stuff for a spin! :)

Oh, my totally scientific gut feeling tells me that lock bottlenecks on
remembered sets might be a problem in that bug and that perhaps card
batching could help there too: batch the cards, locally record the changes
that need to be made to each remembered set, then lock them one by one and
perform a bunch of updates for each critical section instead of each
reference one by one. I trust my gut feeling!

Thanks,
/Erik

>
>Thanks,
>  Thomas
>
>
>
>