RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking
Mikael Gerdin
mikael.gerdin at oracle.com
Tue Oct 8 01:05:23 PDT 2013
Martin,
On 10/07/2013 06:04 PM, Doerr, Martin wrote:
> Hi Mikael,
>
> it's even better this way.
Thanks for reviewing!
/Mikael
>
> Thanks for adding my proposal so quickly,
> Martin
>
> -----Original Message-----
> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> Sent: Montag, 7. Oktober 2013 17:49
> To: Doerr, Martin; Vitaly Davidovich
> Cc: David Holmes; hotspot-dev developers
> Subject: Re: RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking
>
> Hi Martin,
>
> On 10/07/2013 04:13 PM, Doerr, Martin wrote:
>> Hi Mikael,
>>
>> your proposed webrev looks good. It does prevent the problem which I had reported.
>> And I like the card table based recognition of young regions.
>>
>> I have only one comment on the new implementation of G1SATBCardTableLoggingModRefBS::invalidate.
>> It's not necessary to execute the store-load-barrier inside the loops. E.g. it could be implemented this way:
>> // Skip young gen.
>> while ((byte <= last_byte) && (*byte == g1_young_gen)) byte++;
>> if (byte <= last_byte) {
>> // Old gen is included. Must fence before checking cards.
>> OrderAccess::storeload();
>> for (; byte <= last_byte; byte++) {
>> if (*byte == g1_young_gen) {
>> continue;
>> }
>> if (*byte != dirty_card) {
>> *byte = dirty_card;
>> jt->dirty_card_queue().enqueue(byte);
>> }
>> }
>> }
>
> Ahh, you are correct. I like your idea but I modified it slightly.
>
> Incremental webrev:
> http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0.to.1/
> Full webrev:
> http://cr.openjdk.java.net/~mgerdin/8014555/webrev.1/
>
> /Mikael
>
>>
>> (I guess the MutexLockerEx already contains some kind of store-load-ordering, but I'd keep the additional
>> OrderAccess::storeload() in the non-java-thread case to be on the safe side.)
>> It's probably not a big deal. Just wanted to share this thought.
>
> You are probably correct, but I'd like to keep the code consistent and I
> dont think calling invalidate() from the VM thread is a hot path.
>
> /Mikael
>
>>
>> @Vitaly: The original problem was discussed in the thread "G1 question: concurrent cleaning of dirty cards".
>> I had posted the following scenario:
>> Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)
>>
>> (1) store(oop)
>> ( StoreLoad required here ?)
>> (2) load(card==dirty)
>>
>> (1') store(card==clean)
>> (2') StoreLoad barrier
>> (3') load(oop)
>>
>> Best regards,
>> Martin
>>
>>
>> -----Original Message-----
>> From: hotspot-dev-bounces at openjdk.java.net [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Mikael Gerdin
>> Sent: Montag, 7. Oktober 2013 14:25
>> To: Vitaly Davidovich
>> Cc: David Holmes; hotspot-dev developers
>> Subject: Re: RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking
>>
>> Vitaly,
>>
>> On 10/04/2013 03:32 PM, Vitaly Davidovich wrote:
>>> Mikael,
>>>
>>> What if you were to read the card value before the store and then check
>>> if it's clean and mark it dirty + enqueue? This would be a loadstore
>>> fence which can be noop in some archs. Or am I missing something?
>>
>> The problem is that we need to make sure that the refinement thread
>> finds our reference when it cleans our card.
>> In order to determine that we need to read the card value from memory
>> _after_ our store to the java object is known to be visible.
>> That way we know that the refinement has not started scanning the card
>> and therefore won't miss our reference.
>>
>> If we read the card value before the store, found that it was already
>> dirty and skipped enqueuing then we are enlarging the window for when
>> the refinement thread may start scanning the card and scan past the
>> address where we would store to the java object.
>>
>> /Mikael
>>
>>>
>>> Thanks
>>>
>>> Sent from my phone
>>>
>>> On Oct 4, 2013 8:18 AM, "Mikael Gerdin" <mikael.gerdin at oracle.com
>>> <mailto:mikael.gerdin at oracle.com>> wrote:
>>>
>>> David,
>>>
>>> On 10/04/2013 11:15 AM, David Holmes wrote:
>>>
>>> Hi Mikael,
>>>
>>> src/cpu/sparc/vm/c1_Runtime1___sparc.cpp:
>>>
>>> I don't quite follow the logic for this:
>>>
>>> __ set(rs, cardtable); // cardtable := <card
>>> table base>
>>> __ ldub(addr, cardtable, tmp); // tmp := [addr +
>>> cardtable]
>>>
>>> + __ cmp_and_br_short(tmp,
>>> G1SATBCardTableModRefBS::g1___young_card_val(), Assembler::equal,
>>> Assembler::pt, young_card);
>>> +
>>> + __
>>> membar(Assembler::Membar_mask___bits(Assembler::StoreLoad));
>>> + __ ldub(addr, cardtable, tmp); // tmp := [addr +
>>> cardtable]
>>>
>>> You load the cardtable with no barrier, checks its value and if
>>> it seems
>>> "wrong" (?) you then issue the storeload barrier and try again.
>>> Why not
>>> just insert the storeload barrier between the store and the load?
>>>
>>> Similar comment for src/cpu/sparc/vm/__macroAssembler_sparc.cpp.
>>>
>>> ---
>>>
>>> src/share/vm/gc___implementation/g1/__g1SATBCardTableModRefBS.cpp:
>>>
>>> void
>>> G1SATBCardTableLoggingModRefBS__::write_ref_field_work(void*
>>> field,
>>> oop
>>> new_val) {
>>> ! volatile jbyte* byte = byte_for(field);
>>> ! if (*byte == g1_young_gen) {
>>> ! return;
>>> ! }
>>> ! OrderAccess::storeload();
>>>
>>> where is the store that you are ordering with?
>>>
>>> Similarly:
>>>
>>> ! for (; byte <= last_byte; byte++) {
>>> ! if (*byte == g1_young_gen) {
>>> ! continue;
>>> ! }
>>> ! OrderAccess::storeload();
>>> if (*byte != dirty_card) {
>>> *byte = dirty_card;
>>> jt->dirty_card_queue().__enqueue(byte);
>>> }
>>> }
>>>
>>> the store seems to be after the barrier ?
>>>
>>>
>>>
>>> Let me take one step back and describe the situation more thouroghly.
>>> The post-barrier for G1 has the following structure before my change
>>> (o is a java object, f is a field in o and x is another java object)
>>>
>>> o.f = x
>>> if (*card_for(o) == dirty) then goto done
>>> *card_for(o) = dirty
>>> enqueue(card_for(o))
>>> done:
>>>
>>> The memory ordering problem can occur between the store
>>> o.f = x and the read of *card_for(o)
>>> This is because the concurrent refinement thread will set a card to
>>> "clean" before it starts scanning the card for references, but that
>>> store is not guaranteed to be visible to the thread executing the
>>> post-barrier.
>>> So if the CPU re-orders the load from card_for(o) to before the
>>> write of o.f the refinement thread may have scanned past the address
>>> of o.f but the application thread sees the stale "dirty" value and
>>> skips the barrier.
>>>
>>>
>>> The simple fix for this race is then to insert a storeload between
>>> the store and the load:
>>> o.f = x
>>> #StoreLoad
>>> if (*card_for(o) == dirty) then goto done
>>> *card_for(o) = dirty
>>> enqueue(card_for(o))
>>> done:
>>>
>>> Inserting the storeload is too expensive since this code is in the
>>> object field write fast-path. So I introduced a way to exclude some
>>> writes by relying on a new card value which I know is not suspect to
>>> concurrent updates:
>>>
>>> o.f = x
>>> if (*card_for(o) == g1_young) then goto done
>>> #StoreLoad (because I can only rely on the value "dirty" if the
>>> store was committed before the following read)
>>> if (*card_for(o) == dirty) then goto done
>>> *card_for(o) = dirty
>>> enqueue(card_for(o))
>>> done:
>>>
>>> I hope this clarifies the issue and the reasoning behind my changes.
>>>
>>> /Mikael
>>>
>>>
>>> Cheers,
>>> David
>>>
>>> On 2/10/2013 10:28 PM, Mikael Gerdin wrote:
>>>
>>> Hi
>>>
>>> Please review my fix for the issue discussed in the "G1
>>> question:
>>> concurrent cleaning of dirty cards" thread on the
>>> hotspot-gc-dev mailing
>>> list.
>>>
>>> I'd like someone from the compiler (and runtime? the
>>> interpreter uses
>>> macroAssembler_*, right?) teams to at least look at the
>>> changes to:
>>> macroAssembler_*.cpp
>>> c1_Runtime_*.cpp
>>> graphKit.cpp
>>>
>>> Problem description:
>>> G1 has a race where the concurrent refinement thread may
>>> miss object
>>> references in a dirty card.
>>> The problem arises if the CPU re-orders the load of the old
>>> card value
>>> (which G1 checks to determine if it can skip the barrier)
>>> before the store to the actual object.
>>> If that occurs the concurrent refinement thread may have set
>>> the card to
>>> "clean" and proceeded to scan the card but the java thread
>>> may have seen
>>> the "dirty" value and skipped the post barrier.
>>>
>>> Suggested fix:
>>> * Add a memory fence between the store to a java object and
>>> the reading
>>> of the previous card value.
>>> * Modify the code for handling young regions so that all
>>> writes to young
>>> regions can skip the fence (since it will never be needed
>>> for such
>>> writes). This introduces a new value in the card table for
>>> G1 which
>>> indicates a young region.
>>>
>>> Performance impact:
>>> * This fix has a negative throughput performance impact of
>>> 1-1.5%
>>> (tested on x86-AMD x86-Intel and SPARC).
>>> * We may want to try to get rid of this race at some point by
>>> redesigning G1's post barrier but there is not enough time
>>> to do that
>>> for JDK8.
>>>
>>> Performance numbers for x86 platforms can be seen here:
>>> http://cr.openjdk.java.net/~__mgerdin/8014555/perf.txt
>>> <http://cr.openjdk.java.net/~mgerdin/8014555/perf.txt>
>>>
>>> Unfortunately the JIRA issue is not externally visible, but
>>> the major
>>> parts of the discussions about this are present in the
>>> mailing list
>>> thread. The bug mostly contains my analysis of the crashes
>>> which seems
>>> to have been caused by this bug.
>>>
>>> Bug link: https://bugs.openjdk.java.net/__browse/JDK-8014555
>>> <https://bugs.openjdk.java.net/browse/JDK-8014555>
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~__mgerdin/8014555/webrev.0
>>> <http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0>
>>>
>>>
>>
>
More information about the hotspot-dev
mailing list