RFR 8087198: G1 card refinement: batching, sorting, joining, prefetching

Thu Jun 18 15:58:32 UTC 2015

Hi,

  some notes about the changes: the change does not compile as is, it
does not compile without precompiled headers, or in (fast/slow)debug
mode.

The change still contains a debug printf (using non-Hotspot printf) - I
hope you did not do your benchmarking with the change you sent (That
fails compilation btw with gcc 4.8.2, i.e. the release compiler version,
on Linux).

Further, in fastdebug mode it crashes with a fatal error when running
specjbb2000 (but probably other applications too): the message is
"should not call operator new[]" somewhere in the
BufferedRefineCardTableEntryClosure constructor.

The other patch about the dynamic barrier elision also does not compile
without precompiled headers (run configure with
--disable-precompiled-headers).

We really appreciate your work, but it would be a lot less work for
us/me if suggested changes compiled out of the box.

Not sure why using Xcode (assuming you are using OS X) or gcc (if you
happened to use linux) did not complain in any of these cases.

I will answer to the other questions later.

Thanks,
  Thomas

On Tue, 2015-06-16 at 18:19 +0000, Erik Österlund wrote:
> Hi Thomas,
> 
> Den 16/06/15 02:08 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:
> 
> >Hi Erik,
> >
> >On Mon, 2015-06-15 at 15:08 +0000, Erik Österlund wrote:
> >
> >> 
> >> Den 15/06/15 04:11 skrev Thomas Schatzl <thomas.schatzl at oracle.com>:
> >> 
> >>
> >> >> Basically we currently process cards one by one even though they
> >>always
> >> >> come in groups of at the very least G1UpdateBufferSize. The cards
> >>may or
> >> >> may not follow a particularly cache friendly order, depending on how
> >> >>lucky
> >> >> we are, and need to do things like seek the start of the card quite
> >> >>often,
> >> >> as well as cutting arrays at the card intervals, even though many
> >>times
> >> >> the batches of cards could be linearly scanned and follow a
> >>contiguous
> >> >> order.
> >> >
> >> >The question is, is this true? I remember doing some experiments on
> >>that
> >> >years ago, with different applications, and it did not seem to be worth
> >> >doing because the cards were not contiguous. Iirc there were almost
> >>always
> >> >just very little consecutive runs of cards (within a complete buffer)
> >> >that did
> >> >not amount to much.
> >> >So sorting seemed more expensive than the gain due to batching them for
> >> >reference analysis.
> >> 
> >> I don¹t know if it is always true that intervals are contiguous. In some
> >> applications they are (yes I checked, for instance h2 of Dacapo has
> >>almost
> >> only contiguous cards from humongous arrays that seem to be sweeped
> >>over)
> >> which is sweet.
> >> 
> >> And when they are not and this is not true and intervals are broken - I
> >> use prefetching instead. So regardless of the case, caching interactions
> >> are improved.
> >> 
> >> I benchmarked the cost of the sorting and could not find a difference at
> >> all. I find it very cheap as all the data is local to the closure and L1
> >> cached. The cost however of missing cache lines when scanning cards when
> >> they could have been contiguous intervals
> >
> >I expect that hardware prefetch of the next line(s) will automatically
> >prefetch them.
> 
> Are you saying I should, after sorting, skip the explicit prefetch of the
> contiguous intervals, since hardware should implicitly do it anyway? Hmm,
> I guess this might be possible. But I think it¹s still better to be
> explicit about it, as we see other code do as well. The hardware probably
> doesn¹t know how long the contiguous interval is to be prefetched, but we
> do.
> 
> >
> >> >I remember more recent experiments on the refinement buffers generated
> >>by
> >> >the evacuation (in G1 remembered set updates during GC are not
> >>immediately
> >> >reflected in the remembered set, but G1 creates refinement entries
> >> >instead),
> >> >and there the situation seemed even worse.
> >> 
> >> My patch currently only applies to card refinement by the concurrent
> >> refinement threads and mutators. I didn¹t touch the code for scanning
> >>the
> >> cards during evacuation pauses. But I expect similar stuff can be done
> >> there.
> >> 
> >> Also, maybe there could be a flag for reflecting the updates immediately
> >> so the work doesn¹t have to be done,
> >
> >That has been available before, but removed because it was buggy
> >(JDK-8052172). It could have been fixed, but it was not worth it.
> 
> I can imagine. I didn¹t think it would be worth it either so I didn¹t
> bother.
> 
> >
> >> for users that care more about
> >> optimizing performance than for having perfect latencies. Do you happen
> >>to
> >> know how slow the updates to the remembered sets are compared to the
> >>cost
> >> of scanning the cards roughly speaking? Curious where the bottlenecks
> >>are
> >> so I know where to focus optimizations.
> >
> >Remembered set updates are extremely slow in some cases, particularly in
> >the contended case. It's almost always better to let the locking happen
> >in the concurrent phase :)
> >
> >Some details: By now you should know that the current design has three
> >levels, for every source region we store remembered set entries either
> >in the sparse, fine or coarse per-region remembered set. First we try to
> >fill up the sparse rset, then fine and finally use coarse.
> >
> >Most per-region remembered sets are sparse rsets. The problem is, adding
> >to it requires the current implementation to take a lock, which
> >performance is very bad if contended.
> >
> >It is doable to remove it, i.e. replace it with a lock-free
> >implementation (like making a mechanism as in the proposed fix for
> >JDK-8047328 lock-free), and you may want to have a try at it.
> >
> >However, we are also looking into remembered sets right now as the
> >current implementation has a few other scalability problems that we
> >would like to have addressed for the future.
> 
> Very interesting information, thanks. Replacing lock-based solutions with
> lock-free ones can help many times.
> However, even with lock-free solutions, they do not completely remedy
> contention symptoms, they just get rid of blocking (which can be
> disastrous to be fair). But unless the lock-free data structure gets
> enough space, which we probably fundamentally cannot afford for sparse
> entries, all threads will contend for the same data, still impeding
> scalability to some extent.
> 
> How about doing things like try_add_reference() instead which does
> try_lock() instead of blocking locking? So if any lock is contended, we
> remember the reference and deal with it later instead. The idea is that if
> there is contention, we are probably better off doing something more
> useful elsewhere where there is no contention than insisting on either
> blocking or having lock-free contention with other threads trying to
> modify the same small data. In other words, instead of making contention
> cheaper, try and avoid contention in the first place.
> 
> It¹s also possible to batch and distribute the problematic references that
> failed at try_add_reference() due to contention in a smarter way to
> different worker threads in such a way that there will not be any
> contention between those worker threads. It is possible to do this without
> having any write-write contention between the workers, but I¹ll spare you
> the details as it could possibly be a longish story.
> 
> >
> >> >There is always the advantage of batching them to do less memory
> >>barriers
> >> >like CMS does.
> >> 
> >> Yeah, I also experimented with some ADS tricks that allowed me to remove
> >> the G1 write barrier fences and joining 2 conditional jumps into 1,
> >>which
> >> led to further nice speed ups on my machine. But Dave Dice told me
> >>fences
> >> are becoming really cheap and ADS doesn¹t scale well on multiple socket
> >> machines so I left that out of the patch, assuming my old macbook isn¹t
> >> representative of what customers will have in a few years time. :p
> >
> >My workstation has 40 threads :)
> 
> Yeah. YeeaaahhhhhhŠ :p
> 
> >> 
> >> Oh, my totally scientific gut feeling tells me that lock bottlenecks on
> >> remembered sets might be a problem in that bug and that perhaps card
> >> batching could help there too: batch the cards, locally record the
> >>changes
> >> that need to be made to each remembered set, then lock them one by one
> >>and
> >> perform a bunch of updates for each critical section instead of each
> >> reference one by one. I trust my gut feeling!
> >
> >That might be worth a try, but I do not expect a lot of improvements. If
> >you want to give it a spin, the microbenchmark for JDK-8069273
> >(http://cr.openjdk.java.net/~redestad/8069273/G1HotCardBench.java) might
> >be useful. It should also give good numbers with this change.
> >
> >Other than that, see above.
> 
> Another idea I have that I am pretty sure would solve at least the
> problems in that micro benchmark is to disable remembered sets completely
> when they are not needed and lazily reconstruct them when they are
> actually needed during concurrent mark cycles. Now every region assumes
> that it could be selected for the CSet at the next pause and therefore
> aggressively maintains all incoming edges. Sometimes this is simply
> impossible and sometimes it is possible in theory yet very unlikely. I¹d
> like to try removing RSets for those two cases, at the very least when
> it¹s impossible.
> 
> First of all, a concurrent mark must happen first to estimate liveness of
> the old regions before they can become targets for the CSet. During this
> concurrent marking, remembered sets can be reconstructed as needed. So to
> me it doesn¹t make sense to have the remembered set remember anything at
> all until at least the first concurrent mark after the old region was
> allocated that will activate it and reconstruct it. If we don¹t need to
> collect old, there is no point in maintaining its remembered sets. So a
> flag on each remembered set tells if it¹s active or not. In this benchmark
> it would never be active since there is no need to actually collect old,
> and hence nothing would be added to any remembered set.
> 
> And then, it would be useful to similarly track the velocity of
> degradation of remembered sets and use this to find out if it¹s unlikely
> that the region will be selected for the CSet. If they contain old data
> that is permanently alive and never dies, it would be easy to see between
> two concurrent marks that the liveness is too high to be collected and
> remains unchanged from previous concurrent mark. So it seems unlikely to
> be of any interest to reclaim it any time soon, and its remembered set can
> be released. If this changes, then the RSet can be reactivated and
> reconstructed for the next concurrent mark after it becomes eligible.
> 
> What do you think of this idea? I guess I need to prototype it and try it
> outŠ
> 
> Anyway, I¹ll see what I can do. I¹m at FCRC right now though so might be a
> bit unavailable for a while making it hard to prototype this right now!
> 
> >I will need to think a little about what benchmark to run and how to
> >test for improvements.
> 
> Okay, thanks for looking into this and helping me out Thomas!
> 
> Thanks,
> /Erik
> 
> >
> >Thanks,
> >  Thomas
> >
> >
>