G1 question: concurrent cleaning of dirty cards

Tue Sep 10 14:46:45 UTC 2013

Hi Mikael,

great. Thanks for trying.

Btw.: The comment below should state "if ... card is dirty".

Martin

-----Original Message-----
From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com] 
Sent: Dienstag, 10. September 2013 16:42
To: Doerr, Martin
Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev at openjdk.java.net; Braun, Matthias
Subject: Re: G1 question: concurrent cleaning of dirty cards

Martin,

On 2013-09-10 16:30, Doerr, Martin wrote:
> Hi Mikael,
>
> for performance measurements, only the graphKit part should be relevant.
> So you can try the code below, if you like.

Thanks.

> We definitely need the reload and second comparison, because omitting
> the card marking is only safe if the card which has been loaded after the
> MemBarVolatile is clean.
> I guess the additional branch leads to more branch prediction misses and
> it probably depends on the benchmark and processor if it pays off or not.

Agreed. I'll try it just out of curiosity. I have a few runs going so 
it'll probably be a few days before I get the results.

/Mikael

>
> Best regards,
> Martin
>
>
>
>          __ if_then(card_val, BoolTest::ne, young_card); {
>
>            // Omitting g1_mark_card is only allowed if sequentially consistent version of card is clean.
>            Node* not_already_dirty = __ make_label(1);
>            __ if_then(card_val, BoolTest::ne, dirty_card); {
>              __ goto_(not_already_dirty);
>            } __ end_if();
>
>            sync_kit(ideal);
>            insert_mem_bar(Op_MemBarVolatile, oop_store);
>            __ sync_kit(this);
>
>            card_val = __ load(__ ctrl(), card_adr, TypeInt::INT, T_BYTE, Compile::AliasIdxRaw);
>            __ if_then(card_val, BoolTest::ne, dirty_card); {
>              __ bind(not_already_dirty);
>              g1_mark_card(ideal, card_adr, oop_store, alias_idx, index, index_adr, buffer, tf);
>            } __ end_if();
>          } __ end_if();
>
>
>
>
> -----Original Message-----
> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> Sent: Montag, 9. September 2013 16:32
> To: Doerr, Martin
> Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev at openjdk.java.net; Braun, Matthias
> Subject: Re: G1 question: concurrent cleaning of dirty cards
>
> Martin,
>
> On 2013-09-09 12:35, Doerr, Martin wrote:
>> Hi Mikael,
>>
>> thanks for this information. We are glad that you're working on this issue.
>>
>> And we appreciate both of your proposals.
>> I was hoping we could avoid memory barriers in the fast paths, but we can live with it as long as the performance penalty and the additional code size are not too bad.
>> I like the card table based filtering of young objects.
>>
>> Just an additional comment on this filtering technique:
>> The membar does not need to get executed if we do mark&enqueue. If the case in which the barrier encounters clean cards occurs often, we could skip the membar.
>> Here's a SPARC example:
>> __ cmp_and_br_short(O2, CardTableModRefBS::g1_young_card_val(), Assembler::equal, Assembler::pt, young_card);
>> __ cmp_and_br_short(O2, CardTableModRefBS::dirty_card_val(), Assembler::notEqual, Assembler::pt, not_already_dirty);
>> __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad));
>> ... reload
>> I guess it won't fix the performance penalty. I just wanted to share this idea with you.
>
> Right, if the card value is clean_card_val we don't need to take the
> membar. On the other hand this adds another conditional branch before
> the membar in the barrier, should we then take another conditional
> branch depending on the reloaded value?
> I'm already stretching my abilities in poking around in the code
> generation parts of the VM but I could probably do some performance runs
> if you want to provide a patch to add the additional conditionals.
>
> I don't know if the trade-off is worth it or not.
>
>>
>> Hopefully the checkpointing approach will perform better in the long term.
>
> I agree, it would be nice to slim down the barriers instead of inflating
> them further.
>
> /Mikael
>
>>
>> Best regards,
>> Martin
>>
>>
>> -----Original Message-----
>> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
>> Sent: Montag, 9. September 2013 10:45
>> To: Doerr, Martin
>> Cc: Igor Veresov; Thomas Schatzl; hotspot-gc-dev at openjdk.java.net; Braun, Matthias
>> Subject: Re: G1 question: concurrent cleaning of dirty cards
>>
>> Martin,
>>
>> On 2013-09-06 18:54, Doerr, Martin wrote:
>>> Hi,
>>>
>>> thanks for sharing your ideas.
>>> Queuing up buffers and releasing them for refinement at a safepoint sounds like a really good solution.
>>>
>>> I believe that the improvement of G1 barriers is already planned. E.g. there's RFE 6816756.
>>> And it should be possible to port the C2 compiler CMS barrier elision (see GraphKit::write_barrier_post) for G1 as well.
>>> This could reduce the frequency of overflowing buffers.
>>>
>>> (I'm not against experimenting with additional filtering code, but this seems to be kind of infamous at the moment,
>>> because people are already concerned about large barrier code.)
>>
>> I have prototyped both a version of the filtering barrier and the
>> "special safepoint" variant.
>>
>> The filtering barrier takes a few % of performance on jbb2013 and my
>> prototype of the "special safepoint/checkpointing" has horrible (-60%)
>> performance on jbb2013.
>>
>> The checkpointing change needs a lot more work on tweaking the limits
>> and policies for triggering the safepoint and checkpointing the buffers.
>> I basically just wanted to get it to work without crashing and see a
>> ballpark performance number.
>>
>> I don't have a special preference for any of the possible solutions, but
>> I'm not sure if I have the time to get the checkpointing variant into
>> shape for JDK 8 Zero bug bounce, which is Oct 24th[1].
>>
>> One possible approach would be to do the filtering change now and work
>> on the checkpointing variant as a future task (or in parallel by someone
>> else).
>>
>> Webrevs (caution, wear safety glasses! The code is _not_ pretty):
>> http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/checkpointing/webrev/
>> http://cr.openjdk.java.net/~mgerdin/g1-conc-clearing/filtering/webrev/
>>
>>>
>>> May I ask for a bug id or something which allows tracking of this issue?
>>> Hopefully, it can be addressed during development of hotspot 25.
>>
>> I am currently working on this issue under bug id 8014555. Unfortunately
>> that bug's description contains internal information and is therefore
>> not visible on bugs.sun.com.
>> On the other hand, most of the information in the bug consists of
>> analysis of the crashes and not any discussion about the actual memory
>> ordering problem. In fact, I've not been able to prove that the cause
>> for the crashes in the bug are due to this problem, but if I run the
>> test with any of my attempted fixes the crash does not happen.
>>
>> /Mikael
>>
>> [1] http://openjdk.java.net/projects/jdk8/milestones
>>
>>>
>>> Best regards,
>>> Martin
>>>
>>>
>>> -----Original Message-----
>>> From: hotspot-gc-dev-bounces at openjdk.java.net [mailto:hotspot-gc-dev-bounces at openjdk.java.net] On Behalf Of Igor Veresov
>>> Sent: Donnerstag, 18. Juli 2013 21:36
>>> To: Thomas Schatzl
>>> Cc: hotspot-gc-dev at openjdk.java.net; Braun, Matthias
>>> Subject: Re: G1 question: concurrent cleaning of dirty cards
>>>
>>> I think I tried something like that a while a ago (additional filtering technique similar to what you're proposing). The problem is that even if the table entry is in cache the additional check bloat the size of already huge barrier code. But it doesn't mean you can't try again, may be it's a adequate price to pay now for correctness.
>>> Although, the cardtable-based filtering is still going to be there for the old gen, right? So you'll need a StoreLoad barrier for it to work.
>>>
>>> The alternative approach that I outlined before doesn't need any barrier modification, although would require a bunch of runtime changes.
>>> It would work as follows:
>>> - you execute normally, producing the buffers with modified cards. But the buffers produced will not be available to the conc refinement threads, they're just let accumulate for a while..
>>> - when you reach a certain number of buffers (say, in the "yellow zone"), you have a special safepoint during which you grab all the buffers and put them on another queue, that is accessible to the conc refinement threads.
>>> - you iterate over the cards in the buffers that you just grabbed and clean them (which solves the original problem); also can be done very fast in parallel.
>>> - after the execution resumes the conc refinement threads start processing buffers (from that second queue), using the existing card caching which would become more important. Mutators can also participate (as they do now) if the number of the buffers in the second queue would be in the "red zone".
>>>
>>> May be both approaches should be tried and evaluated..?
>>>
>>> igor
>>>
>>> On Jul 17, 2013, at 4:20 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>     trying to revive that somewhat dying thread with some suggestions...
>>>>
>>>> On Fri, 2013-06-28 at 16:02 -0700, Igor Veresov wrote:
>>>>> The mutator processing doesn't solve it. The card clearing event is
>>>>> still asynchronous with respect to possible mutations in other
>>>>> threads. While one mutator thread is processing buffers and clearing
>>>>> cards the other can sneak in and do the store to the same object that
>>>>> will go unnoticed. So I'm afraid it's either a store-load barrier, or
>>>>> we need to stop all mutator threads to prevent this race, or worse..
>>>>
>>>> One option to reduce the overhead of the store-load barrier is to only
>>>> execute it if it is needed; actually a large part of the memory accesses
>>>> are to the young gen.
>>>> These accesses are going to be filtered out by the existing mechanism
>>>> anyway, are always dirty, and never reset to clean.
>>>>
>>>> An (e.g. per-region) auxiliary table could be used that indicates that
>>>> for a particular region we will actually need the card mark and the
>>>> storeload barrier or not.
>>>>
>>>> Outside of safepoints, entries to that table are only ever marked dirty,
>>>> never reset to clean. This could be done without synchronization I
>>>> think, as in the worst case a thread will see from the card table that
>>>> the corresponding regions' cards are dirty (i.e. will be filtered
>>>> anyway).
>>>>
>>>> The impact of the additional cost in the barrier might be offset by the
>>>> cache bandwidth saved by not accessing the card table to some degree
>>>> (and avoiding the StoreLoad barrier for most accesses). The per-region
>>>> table should be small (a byte per region would be sufficient).
>>>>
>>>> Actually one could tests where the actual card table lookup is
>>>> completely disabled and just always handle mutations in the areas not
>>>> covered by this table.
>>>> If this area is sufficiently small, this could be an option.
>>>>
>>>>> On Jun 28, 2013, at 1:53 PM, John Cuthbertson
>>>>> <john.cuthbertson at oracle.com> wrote:
>>>>>
>>>>>> Hi Igor,
>>>>>>
>>>>>> Yeah G1 has that facility right now. In fact you added it. :) When
>>>>>> the number of completed buffers is below the green zone upper limit,
>>>>>> none of the refinement threads are refining buffers. That is the
>>>>>> green zone upper limit is number of buffers that we expect to be
>>>>>> able to process during the GC without it going over some percentage
>>>>>> of the pause time (I think the default is 10%). When the number of
>>>>>> buffers grows above the green zone upper limit, the refinement
>>>>>> threads start processing the buffers in stepped manner.
>>>>>>
>>>>>> So during the safepoint we would process N - green-zone-upper-limit
>>>>>> completed buffers. In fact we could have a watcher task that
>>>>>> monitors the number of completed buffers and triggers a safepoint
>>>>>> when the number of completed buffers becomes sufficiently high - say
>>>>>> above the yellow-zone upper limit.
>>>>>>
>>>>>> That does away with the whole notion of concurrent refinement but
>>>>>> will remove a lot of the nasty complicated code that gets executed
>>>>>> by the mutators or refinement threads.
>>>>
>>>> I think it is possible to only reset the card table at the safepoint;
>>>> the buffers that were filled before taking the snapshot can still be
>>>> processed concurrently afterwards.
>>>>
>>>> (That is also Igor's suggestion from the other email I think).
>>>>
>>>> That may be somewhat expensive for very large heaps; but as you mention
>>>> that effort could be limited by only cleaning the cards that have a
>>>> completed buffer entry.
>>>>
>>>>>> My main concern is that the we would be potentially  increasing the
>>>>>> number and duration of non-GC safepoints which cause issues with
>>>>>> latency sensitive apps. For those workloads that only care about 90%
>>>>>> of the transactions this approach would probably be fine.
>>>>>>
>>>>>> We would need to evaluate the performance of each approach.
>>>>
>>>> Hth,
>>>> Thomas
>>>>
>>>>
>>>