G1 question: concurrent cleaning of dirty cards

Mon Sep 9 08:44:57 UTC 2013

Martin,

On 2013-09-06 18:54, Doerr, Martin wrote:
> Hi,
>
> thanks for sharing your ideas.
> Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.
>
> I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756.
> And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well.
> This could reduce the frequency of overflowing buffers.
>
> (I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment,
> because people are already concerned about large barrier code.)

I have prototyped both a version of the filtering barrier and the 
"special safepoint" variant.

The filtering barrier takes a few % of performance on jbb2013 and my 
prototype of the "special safepoint/checkpointing" has horrible (-60%) 
performance on jbb2013.

The checkpointing change needs a lot more work on tweaking the limits 
and policies for triggering the safepoint and checkpointing the buffers. 
I basically just wanted to get it to work without crashing and see a 
ballpark performance number.

I don't have a special preference for any of the possible solutions, but 
I'm not sure if I have the time to get the checkpointing variant into 
shape for JDK 8 Zero bug bounce, which is Oct 24th[1].

One possible approach would be to do the filtering change now and work 
on the checkpointing variant as a future task (or in parallel by someone 
else).

Webrevs (caution, wear safety glasses! The code is _not_ pretty):
http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/
http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/

>
> May I ask for a bug id or something which allows tracking of this issue?
> Hopefully, it can be addressed during development of hotspot 25.

I am currently working on this issue under bug id 8014555. Unfortunately 
that bug's description contains internal information and is therefore 
not visible on bugs.sun.com.
On the other hand, most of the information in the bug consists of 
analysis of the crashes and not any discussion about the actual memory 
ordering problem. In fact, I've not been able to prove that the cause 
for the crashes in the bug are due to this problem, but if I run the 
test with any of my attempted fixes the crash does not happen.

/Mikael

[1] http://openjdk.java.net/projects/jdk8/milestones

>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: hotspot-gc-dev-bounces at openjdk.java.net [mailto:hotspot-gc-dev-bounces at openjdk.java.net] On Behalf Of Igor Veresov
> Sent: Donnerstag, 18. Juli 2013 21:36
> To: Thomas Schatzl
> Cc: hotspot-gc-dev at openjdk.java.net; Braun, Matthias
> Subject: Re: G1 question: concurrent cleaning of dirty cards
>
> I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness.
> Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.
>
> The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes.
> It would work as follows:
> - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while..
> - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads.
> - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel.
> - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".
>
> May be both approaches should be tried and evaluated..?
>
> igor
>
> On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
>
>> Hi,
>>
>>   trying to revive that somewhat dying thread with some suggestions...
>>
>> On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:
>>> The mutator processing doesn't solve it. The card clearing event is
>>> still asynchronous with respect to possible mutations in other
>>> threads. While one mutator thread is processing buffers and clearing
>>> cards the other can sneak in and do the store to the same object that
>>> will go unnoticed. So I'm afraid it's either a store-load barrier, or
>>> we need to stop all mutator threads to prevent this race, or worse..
>>
>> One option to reduce the overhead of the store-load barrier is to only
>> execute it if it is needed; actually a large part of the memory accesses
>> are to the young gen.
>> These accesses are going to be filtered out by the existing mechanism
>> anyway, are always dirty, and never reset to clean.
>>
>> An (e.g. per-region) auxiliary table could be used that indicates that
>> for a particular region we will actually need the card mark and the
>> storeload barrier or not.
>>
>> Outside of safepoints, entries to that table are only ever marked dirty,
>> never reset to clean. This could be done without synchronization I
>> think, as in the worst case a thread will see from the card table that
>> the corresponding regions' cards are dirty (i.e. will be filtered
>> anyway).
>>
>> The impact of the additional cost in the barrier might be offset by the
>> cache bandwidth saved by not accessing the card table to some degree
>> (and avoiding the StoreLoad barrier for most accesses). The per-region
>> table should be small (a byte per region would be sufficient).
>>
>> Actually one could tests where the actual card table lookup is
>> completely disabled and just always handle mutations in the areas not
>> covered by this table.
>> If this area is sufficiently small, this could be an option.
>>
>>> On Jun 28, 2013, at 1:53 PM, John Cuthbertson
>>> <john.cuthbertson at oracle.com> wrote:
>>>
>>>> Hi Igor,
>>>>
>>>> Yeah G1 has that facility right now. In fact you added it. :) When
>>>> the number of completed buffers is below the green zone upper limit,
>>>> none of the refinement threads are refining buffers. That is the
>>>> green zone upper limit is number of buffers that we expect to be
>>>> able to process during the GC without it going over some percentage
>>>> of the pause time (I think the default is 10%). When the number of
>>>> buffers grows above the green zone upper limit, the refinement
>>>> threads start processing the buffers in stepped manner.
>>>>
>>>> So during the safepoint we would process N - green-zone-upper-limit
>>>> completed buffers. In fact we could have a watcher task that
>>>> monitors the number of completed buffers and triggers a safepoint
>>>> when the number of completed buffers becomes sufficiently high - say
>>>> above the yellow-zone upper limit.
>>>>
>>>> That does away with the whole notion of concurrent refinement but
>>>> will remove a lot of the nasty complicated code that gets executed
>>>> by the mutators or refinement threads.
>>
>> I think it is possible to only reset the card table at the safepoint;
>> the buffers that were filled before taking the snapshot can still be
>> processed concurrently afterwards.
>>
>> (That is also Igor's suggestion from the other email I think).
>>
>> That may be somewhat expensive for very large heaps; but as you mention
>> that effort could be limited by only cleaning the cards that have a
>> completed buffer entry.
>>
>>>> My main concern is that the we would be potentially  increasing the
>>>> number and duration of non-GC safepoints which cause issues with
>>>> latency sensitive apps. For those workloads that only care about 90%
>>>> of the transactions this approach would probably be fine.
>>>>
>>>> We would need to evaluate the performance of each approach.
>>
>> Hth,
>> Thomas
>>
>>
>