RFR 8087198: G1 card refinement: batching, sorting, joining, prefetching

Tue Jun 16 18:19:05 UTC 2015

Hi Thomas,

Den 16/06/15 02:08 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:

>Hi Erik,
>
>On Mon, 2015-06-15 at 15:08 +0000, Erik Österlund wrote:
>
>> 
>> Den 15/06/15 04:11 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:
>> 
>>
>> >> Basically we currently process cards one by one even though they
>>always
>> >> come in groups of at the very least G1UpdateBufferSize. The cards
>>may or
>> >> may not follow a particularly cache friendly order, depending on how
>> >>lucky
>> >> we are, and need to do things like seek the start of the card quite
>> >>often,
>> >> as well as cutting arrays at the card intervals, even though many
>>times
>> >> the batches of cards could be linearly scanned and follow a
>>contiguous
>> >> order.
>> >
>> >The question is, is this true? I remember doing some experiments on
>>that
>> >years ago, with different applications, and it did not seem to be worth
>> >doing because the cards were not contiguous. Iirc there were almost
>>always
>> >just very little consecutive runs of cards (within a complete buffer)
>> >that did
>> >not amount to much.
>> >So sorting seemed more expensive than the gain due to batching them for
>> >reference analysis.
>> 
>> I don¹t know if it is always true that intervals are contiguous. In some
>> applications they are (yes I checked, for instance h2 of Dacapo has
>>almost
>> only contiguous cards from humongous arrays that seem to be sweeped
>>over)
>> which is sweet.
>> 
>> And when they are not and this is not true and intervals are broken - I
>> use prefetching instead. So regardless of the case, caching interactions
>> are improved.
>> 
>> I benchmarked the cost of the sorting and could not find a difference at
>> all. I find it very cheap as all the data is local to the closure and L1
>> cached. The cost however of missing cache lines when scanning cards when
>> they could have been contiguous intervals
>
>I expect that hardware prefetch of the next line(s) will automatically
>prefetch them.

Are you saying I should, after sorting, skip the explicit prefetch of the
contiguous intervals, since hardware should implicitly do it anyway? Hmm,
I guess this might be possible. But I think it¹s still better to be
explicit about it, as we see other code do as well. The hardware probably
doesn¹t know how long the contiguous interval is to be prefetched, but we
do.

>
>> >I remember more recent experiments on the refinement buffers generated
>>by
>> >the evacuation (in G1 remembered set updates during GC are not
>>immediately
>> >reflected in the remembered set, but G1 creates refinement entries
>> >instead),
>> >and there the situation seemed even worse.
>> 
>> My patch currently only applies to card refinement by the concurrent
>> refinement threads and mutators. I didn¹t touch the code for scanning
>>the
>> cards during evacuation pauses. But I expect similar stuff can be done
>> there.
>> 
>> Also, maybe there could be a flag for reflecting the updates immediately
>> so the work doesn¹t have to be done,
>
>That has been available before, but removed because it was buggy
>(JDK-8052172). It could have been fixed, but it was not worth it.

I can imagine. I didn¹t think it would be worth it either so I didn¹t
bother.

>
>> for users that care more about
>> optimizing performance than for having perfect latencies. Do you happen
>>to
>> know how slow the updates to the remembered sets are compared to the
>>cost
>> of scanning the cards roughly speaking? Curious where the bottlenecks
>>are
>> so I know where to focus optimizations.
>
>Remembered set updates are extremely slow in some cases, particularly in
>the contended case. It's almost always better to let the locking happen
>in the concurrent phase :)
>
>Some details: By now you should know that the current design has three
>levels, for every source region we store remembered set entries either
>in the sparse, fine or coarse per-region remembered set. First we try to
>fill up the sparse rset, then fine and finally use coarse.
>
>Most per-region remembered sets are sparse rsets. The problem is, adding
>to it requires the current implementation to take a lock, which
>performance is very bad if contended.
>
>It is doable to remove it, i.e. replace it with a lock-free
>implementation (like making a mechanism as in the proposed fix for
>JDK-8047328 lock-free), and you may want to have a try at it.
>
>However, we are also looking into remembered sets right now as the
>current implementation has a few other scalability problems that we
>would like to have addressed for the future.

Very interesting information, thanks. Replacing lock-based solutions with
lock-free ones can help many times.
However, even with lock-free solutions, they do not completely remedy
contention symptoms, they just get rid of blocking (which can be
disastrous to be fair). But unless the lock-free data structure gets
enough space, which we probably fundamentally cannot afford for sparse
entries, all threads will contend for the same data, still impeding
scalability to some extent.

How about doing things like try_add_reference() instead which does
try_lock() instead of blocking locking? So if any lock is contended, we
remember the reference and deal with it later instead. The idea is that if
there is contention, we are probably better off doing something more
useful elsewhere where there is no contention than insisting on either
blocking or having lock-free contention with other threads trying to
modify the same small data. In other words, instead of making contention
cheaper, try and avoid contention in the first place.

It¹s also possible to batch and distribute the problematic references that
failed at try_add_reference() due to contention in a smarter way to
different worker threads in such a way that there will not be any
contention between those worker threads. It is possible to do this without
having any write-write contention between the workers, but I¹ll spare you
the details as it could possibly be a longish story.

>
>> >There is always the advantage of batching them to do less memory
>>barriers
>> >like CMS does.
>> 
>> Yeah, I also experimented with some ADS tricks that allowed me to remove
>> the G1 write barrier fences and joining 2 conditional jumps into 1,
>>which
>> led to further nice speed ups on my machine. But Dave Dice told me
>>fences
>> are becoming really cheap and ADS doesn¹t scale well on multiple socket
>> machines so I left that out of the patch, assuming my old macbook isn¹t
>> representative of what customers will have in a few years time. :p
>
>My workstation has 40 threads :)

Yeah. YeeaaahhhhhhŠ :p

>> 
>> Oh, my totally scientific gut feeling tells me that lock bottlenecks on
>> remembered sets might be a problem in that bug and that perhaps card
>> batching could help there too: batch the cards, locally record the
>>changes
>> that need to be made to each remembered set, then lock them one by one
>>and
>> perform a bunch of updates for each critical section instead of each
>> reference one by one. I trust my gut feeling!
>
>That might be worth a try, but I do not expect a lot of improvements. If
>you want to give it a spin, the microbenchmark for JDK-8069273
>(http://cr.openjdk.java.net/~redestad/8069273/G1HotCardBench.java) might
>be useful. It should also give good numbers with this change.
>
>Other than that, see above.

Another idea I have that I am pretty sure would solve at least the
problems in that micro benchmark is to disable remembered sets completely
when they are not needed and lazily reconstruct them when they are
actually needed during concurrent mark cycles. Now every region assumes
that it could be selected for the CSet at the next pause and therefore
aggressively maintains all incoming edges. Sometimes this is simply
impossible and sometimes it is possible in theory yet very unlikely. I¹d
like to try removing RSets for those two cases, at the very least when
it¹s impossible.

First of all, a concurrent mark must happen first to estimate liveness of
the old regions before they can become targets for the CSet. During this
concurrent marking, remembered sets can be reconstructed as needed. So to
me it doesn¹t make sense to have the remembered set remember anything at
all until at least the first concurrent mark after the old region was
allocated that will activate it and reconstruct it. If we don¹t need to
collect old, there is no point in maintaining its remembered sets. So a
flag on each remembered set tells if it¹s active or not. In this benchmark
it would never be active since there is no need to actually collect old,
and hence nothing would be added to any remembered set.

And then, it would be useful to similarly track the velocity of
degradation of remembered sets and use this to find out if it¹s unlikely
that the region will be selected for the CSet. If they contain old data
that is permanently alive and never dies, it would be easy to see between
two concurrent marks that the liveness is too high to be collected and
remains unchanged from previous concurrent mark. So it seems unlikely to
be of any interest to reclaim it any time soon, and its remembered set can
be released. If this changes, then the RSet can be reactivated and
reconstructed for the next concurrent mark after it becomes eligible.

What do you think of this idea? I guess I need to prototype it and try it
outŠ

Anyway, I¹ll see what I can do. I¹m at FCRC right now though so might be a
bit unavailable for a while making it hard to prototype this right now!

>I will need to think a little about what benchmark to run and how to
>test for improvements.

Okay, thanks for looking into this and helping me out Thomas!

Thanks,
/Erik

>
>Thanks,
>  Thomas
>
>