RFR 8087198: G1 card refinement: batching, sorting, joining, prefetching

Tue Jun 16 09:08:19 UTC 2015

Hi Erik,

On Mon, 2015-06-15 at 15:08 +0000, Erik Österlund wrote:
> Hi Thomas,
> 
> Thanks for looking at my patch!

np.

> 
> Den 15/06/15 04:11 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:
> 
> >Hi,
> >
> >  some random comments :)
> >
> >On Thu, 2015-06-11 at 11:20 +0000, Erik Österlund wrote:
> >> Hi guys,
> >> 
> >> I have been experimenting a bit with G1 card refinement. Thought I¹d
> >>check
> >> if anyone finds this interesting. ;)
> >
> >Sure interesting even if only for the refactoring/cleaning up/batching the
> >memory barriers part.
> 
> Thanks! I forgot to mention some of the cleaning up stuff that comes as a
> bonus kind of. Like for instance the closure used for refinement was
> passed around everywhere and globally accessible from the G1 heap, for a
> single reason: forwarding a getter and setter of whether concurrency is
> activated from the G1 heap to the closure. This seemed really messed up to
> me so I made it a flag of the G1 heap instead, removed the closure as a
> global variable, and instead let the closure ask the G1 heap for this
> flag. I don’t know why it was done this way but I just assumed it was a
> mistake and fixed it while fixing stuff. Also I need the closures to have
> private state for buffering.

I think there is a CR for removing the closure.

> >> Basically we currently process cards one by one even though they always
> >> come in groups of at the very least G1UpdateBufferSize. The cards may or
> >> may not follow a particularly cache friendly order, depending on how
> >>lucky
> >> we are, and need to do things like seek the start of the card quite
> >>often,
> >> as well as cutting arrays at the card intervals, even though many times
> >> the batches of cards could be linearly scanned and follow a contiguous
> >> order.
> >
> >The question is, is this true? I remember doing some experiments on that
> >years ago, with different applications, and it did not seem to be worth
> >doing because the cards were not contiguous. Iirc there were almost always
> >just very little consecutive runs of cards (within a complete buffer)
> >that did
> >not amount to much.
> >So sorting seemed more expensive than the gain due to batching them for
> >reference analysis.
> 
> I don’t know if it is always true that intervals are contiguous. In some
> applications they are (yes I checked, for instance h2 of Dacapo has almost
> only contiguous cards from humongous arrays that seem to be sweeped over)
> which is sweet.
> 
> And when they are not and this is not true and intervals are broken - I
> use prefetching instead. So regardless of the case, caching interactions
> are improved.
> 
> I benchmarked the cost of the sorting and could not find a difference at
> all. I find it very cheap as all the data is local to the closure and L1
> cached. The cost however of missing cache lines when scanning cards when
> they could have been contiguous intervals

I expect that hardware prefetch of the next line(s) will automatically
prefetch them.

> >I remember more recent experiments on the refinement buffers generated by
> >the evacuation (in G1 remembered set updates during GC are not immediately
> >reflected in the remembered set, but G1 creates refinement entries
> >instead),
> >and there the situation seemed even worse.
> 
> My patch currently only applies to card refinement by the concurrent
> refinement threads and mutators. I didn’t touch the code for scanning the
> cards during evacuation pauses. But I expect similar stuff can be done
> there.
> 
> Also, maybe there could be a flag for reflecting the updates immediately
> so the work doesn’t have to be done,

That has been available before, but removed because it was buggy
(JDK-8052172). It could have been fixed, but it was not worth it.

> for users that care more about
> optimizing performance than for having perfect latencies. Do you happen to
> know how slow the updates to the remembered sets are compared to the cost
> of scanning the cards roughly speaking? Curious where the bottlenecks are
> so I know where to focus optimizations.

Remembered set updates are extremely slow in some cases, particularly in
the contended case. It's almost always better to let the locking happen
in the concurrent phase :)

Some details: By now you should know that the current design has three
levels, for every source region we store remembered set entries either
in the sparse, fine or coarse per-region remembered set. First we try to
fill up the sparse rset, then fine and finally use coarse.

Most per-region remembered sets are sparse rsets. The problem is, adding
to it requires the current implementation to take a lock, which
performance is very bad if contended.

It is doable to remove it, i.e. replace it with a lock-free
implementation (like making a mechanism as in the proposed fix for
JDK-8047328 lock-free), and you may want to have a try at it.

However, we are also looking into remembered sets right now as the
current implementation has a few other scalability problems that we
would like to have addressed for the future.

> >There is always the advantage of batching them to do less memory barriers
> >like CMS does.
> 
> Yeah, I also experimented with some ADS tricks that allowed me to remove
> the G1 write barrier fences and joining 2 conditional jumps into 1, which
> led to further nice speed ups on my machine. But Dave Dice told me fences
> are becoming really cheap and ADS doesn’t scale well on multiple socket
> machines so I left that out of the patch, assuming my old macbook isn’t
> representative of what customers will have in a few years time. :p

My workstation has 40 threads :)

> So even this amortized fencing may or may not actually matter in a few
> years, but I couldn’t help not to do it because it felt wrong to fence
> after cleaning every card instead of once for the whole batch of cards. My
> fence phobia kicked in!
> 
> >
> >> What my patch does is to first clean all the G1UpdateBufferSize cards
> >>and
> >
> >This is just a comment from briefly looking at the change: I would have
> >first only sorted and merged, grouped into regions, and then applied the
> >various (mostly region-based filters) on the resulting MemRegions.
> >
> >While it is possible to do unnecessary work here, the region-based
> >filtering seem to not apply very often, mostly covering races. Resulting
> >in a net win if we can reduce the cards to a few MemRegions (and that
> >filtering would be applied less often).
> >
> >But I do not really know what is better.
> 
> I think you might be right if I understand you correctly. It was a bit
> more awkward to extract from the current code though as it first does
> region checks and then exchanges cards with the hot card cache, meaning
> that the cards sent in are not the ones you get back out. This makes such
> a change maybe possible but a bit awkward. Because if we preprocess the
> card intervals first to reduce the time for region checks and then get
> other cards back from the cache, we have to process them again. Meh! So I
> didn’t do it for now, but maybe it’s possible if rearranging stuff a bit.

It's fine as it is I think.

[...]

> >>  The start of the next scanning interval
> >> is prefetched while the current interval is scanned for references.
> >
> >I am not sure prefetching helps a lot if I understand correctly,
> >considering
> >that the amount of work done even within a single card is pretty large.
> >I.e.
> >there is a certain risk that prefetching just fills up hardware resources,
> >e.g. load buffers. But I do not know :)
> 
> I tried with and without prefetching it was faster with prefetching in my
> experiments. The reasoning is that it needs to load this data anyway,
> there is no getting around that. So it might as well prefetch it so at
> least it doesn’t have to stall.

> >> Webrev with demo:
> >> 
> >>http://cr.openjdk.java.net/~eosterlund/g1_experiments/card_refinement/web
> >>re
> >> v.00/
> >> 
> >> RFE ID:
> >> https://bugs.openjdk.java.net/browse/JDK-8087198
> >> 
> >> In the DaCapo benchmarks I could find for instance 3.2% speedup in both
> >>fop
> >> (configured to achieve 1 ms latency with a small 64M heap) and h2 3%
> >> speedup running default settings on 512M heap. The measurements were
> >>taken
> >> after 40 warmup iterations, and then sampled the average of 10
> >>iterations.
> >> 
> >> If somebody is interested in sponsoring something like this, it would be
> >> interesting to see how this performs in your favourite benchmarks. I
> >> suspect this kind of stuff is
> >> more important when you deal with larger heaps.
> >
> >I will give it a try on some other benchmarks. I totally unscientifically
> >tried it on the example program in JDK-8062128 (10 runs each with and
> >without the changes), and that seemed to improve ~2.5%, barely
> >statistically
> >significant at alpha=0.05.
> >I expect this to be more like an upper bound in improvement.
> 
> 2.5% is a good start eh! Thanks Thomas for taking my stuff for a spin! :)
> 
> Oh, my totally scientific gut feeling tells me that lock bottlenecks on
> remembered sets might be a problem in that bug and that perhaps card
> batching could help there too: batch the cards, locally record the changes
> that need to be made to each remembered set, then lock them one by one and
> perform a bunch of updates for each critical section instead of each
> reference one by one. I trust my gut feeling!

That might be worth a try, but I do not expect a lot of improvements. If
you want to give it a spin, the microbenchmark for JDK-8069273
(http://cr.openjdk.java.net/~redestad/8069273/G1HotCardBench.java) might
be useful. It should also give good numbers with this change.

Other than that, see above.

I will need to think a little about what benchmark to run and how to
test for improvements.

Thanks,
  Thomas