G1 question: concurrent cleaning of dirty cards

Fri Sep 6 16:54:53 UTC 2013

Hi,

thanks for sharing your ideas.
Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.

I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756.
And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well.
This could reduce the frequency of overflowing buffers.

(I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment,
because people are already concerned about large barrier code.)

May I ask for a bug id or something which allows tracking of this issue?
Hopefully, it can be addressed during development of hotspot 25.

Best regards,
Martin

-----Original Message-----
From: hotspot-gc-dev-bounces at openjdk.java.net [mailto:hotspot-gc-dev-bounces at openjdk.java.net] On Behalf Of Igor Veresov
Sent: Donnerstag, 18. Juli 2013 21:36
To: Thomas Schatzl
Cc: hotspot-gc-dev at openjdk.java.net; Braun, Matthias
Subject: Re: G1 question: concurrent cleaning of dirty cards

I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness.
Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.

The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes.
It would work as follows:
- you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while..
- when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads. 
- you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel.
- after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone". 

May be both approaches should be tried and evaluated..?

igor

On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:

> Hi,
> 
>  trying to revive that somewhat dying thread with some suggestions...
> 
> On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:
>> The mutator processing doesn't solve it. The card clearing event is
>> still asynchronous with respect to possible mutations in other
>> threads. While one mutator thread is processing buffers and clearing
>> cards the other can sneak in and do the store to the same object that
>> will go unnoticed. So I'm afraid it's either a store-load barrier, or
>> we need to stop all mutator threads to prevent this race, or worse..
> 
> One option to reduce the overhead of the store-load barrier is to only
> execute it if it is needed; actually a large part of the memory accesses
> are to the young gen.
> These accesses are going to be filtered out by the existing mechanism
> anyway, are always dirty, and never reset to clean.
> 
> An (e.g. per-region) auxiliary table could be used that indicates that
> for a particular region we will actually need the card mark and the
> storeload barrier or not.
> 
> Outside of safepoints, entries to that table are only ever marked dirty,
> never reset to clean. This could be done without synchronization I
> think, as in the worst case a thread will see from the card table that
> the corresponding regions' cards are dirty (i.e. will be filtered
> anyway).
> 
> The impact of the additional cost in the barrier might be offset by the
> cache bandwidth saved by not accessing the card table to some degree
> (and avoiding the StoreLoad barrier for most accesses). The per-region
> table should be small (a byte per region would be sufficient).
> 
> Actually one could tests where the actual card table lookup is
> completely disabled and just always handle mutations in the areas not
> covered by this table.
> If this area is sufficiently small, this could be an option.
> 
>> On Jun 28, 2013, at 1:53 PM, John Cuthbertson
>> <john.cuthbertson at oracle.com> wrote:
>> 
>>> Hi Igor,
>>> 
>>> Yeah G1 has that facility right now. In fact you added it. :) When
>>> the number of completed buffers is below the green zone upper limit,
>>> none of the refinement threads are refining buffers. That is the
>>> green zone upper limit is number of buffers that we expect to be
>>> able to process during the GC without it going over some percentage
>>> of the pause time (I think the default is 10%). When the number of
>>> buffers grows above the green zone upper limit, the refinement
>>> threads start processing the buffers in stepped manner. 
>>> 
>>> So during the safepoint we would process N - green-zone-upper-limit
>>> completed buffers. In fact we could have a watcher task that
>>> monitors the number of completed buffers and triggers a safepoint
>>> when the number of completed buffers becomes sufficiently high - say
>>> above the yellow-zone upper limit.
>>> 
>>> That does away with the whole notion of concurrent refinement but
>>> will remove a lot of the nasty complicated code that gets executed
>>> by the mutators or refinement threads.
> 
> I think it is possible to only reset the card table at the safepoint;
> the buffers that were filled before taking the snapshot can still be
> processed concurrently afterwards.
> 
> (That is also Igor's suggestion from the other email I think).
> 
> That may be somewhat expensive for very large heaps; but as you mention
> that effort could be limited by only cleaning the cards that have a
> completed buffer entry.
> 
>>> My main concern is that the we would be potentially  increasing the
>>> number and duration of non-GC safepoints which cause issues with
>>> latency sensitive apps. For those workloads that only care about 90%
>>> of the transactions this approach would probably be fine.
>>> 
>>> We would need to evaluate the performance of each approach. 
> 
> Hth,
> Thomas
> 
>