RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking
Mikael Gerdin
mikael.gerdin at oracle.com
Mon Oct 7 08:49:02 PDT 2013
Hi Martin,
On 10/07/2013 04:13 PM, Doerr, Martin wrote:
> Hi Mikael,
>
> your proposed webrev looks good. It does prevent the problem which I had reported.
> And I like the card table based recognition of young regions.
>
> I have only one comment on the new implementation of G1SATBCardTableLoggingModRefBS::invalidate.
> It's not necessary to execute the store-load-barrier inside the loops. E.g. it could be implemented this way:
> // Skip young gen.
> while ((byte <= last_byte) && (*byte == g1_young_gen)) byte++;
> if (byte <= last_byte) {
> // Old gen is included. Must fence before checking cards.
> OrderAccess::storeload();
> for (; byte <= last_byte; byte++) {
> if (*byte == g1_young_gen) {
> continue;
> }
> if (*byte != dirty_card) {
> *byte = dirty_card;
> jt->dirty_card_queue().enqueue(byte);
> }
> }
> }
Ahh, you are correct. I like your idea but I modified it slightly.
Incremental webrev:
http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0.to.1/
Full webrev:
http://cr.openjdk.java.net/~mgerdin/8014555/webrev.1/
/Mikael
>
> (I guess the MutexLockerEx already contains some kind of store-load-ordering, but I'd keep the additional
> OrderAccess::storeload() in the non-java-thread case to be on the safe side.)
> It's probably not a big deal. Just wanted to share this thought.
You are probably correct, but I'd like to keep the code consistent and I
dont think calling invalidate() from the VM thread is a hot path.
/Mikael
>
> @Vitaly: The original problem was discussed in the thread "G1 question: concurrent cleaning of dirty cards".
> I had posted the following scenario:
> Java Thread (mutator) Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)
>
> (1) store(oop)
> ( StoreLoad required here ?)
> (2) load(card==dirty)
>
> (1') store(card==clean)
> (2') StoreLoad barrier
> (3') load(oop)
>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: hotspot-dev-bounces at openjdk.java.net [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Mikael Gerdin
> Sent: Montag, 7. Oktober 2013 14:25
> To: Vitaly Davidovich
> Cc: David Holmes; hotspot-dev developers
> Subject: Re: RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking
>
> Vitaly,
>
> On 10/04/2013 03:32 PM, Vitaly Davidovich wrote:
>> Mikael,
>>
>> What if you were to read the card value before the store and then check
>> if it's clean and mark it dirty + enqueue? This would be a loadstore
>> fence which can be noop in some archs. Or am I missing something?
>
> The problem is that we need to make sure that the refinement thread
> finds our reference when it cleans our card.
> In order to determine that we need to read the card value from memory
> _after_ our store to the java object is known to be visible.
> That way we know that the refinement has not started scanning the card
> and therefore won't miss our reference.
>
> If we read the card value before the store, found that it was already
> dirty and skipped enqueuing then we are enlarging the window for when
> the refinement thread may start scanning the card and scan past the
> address where we would store to the java object.
>
> /Mikael
>
>>
>> Thanks
>>
>> Sent from my phone
>>
>> On Oct 4, 2013 8:18 AM, "Mikael Gerdin" <mikael.gerdin at oracle.com
>> <mailto:mikael.gerdin at oracle.com>> wrote:
>>
>> David,
>>
>> On 10/04/2013 11:15 AM, David Holmes wrote:
>>
>> Hi Mikael,
>>
>> src/cpu/sparc/vm/c1_Runtime1___sparc.cpp:
>>
>> I don't quite follow the logic for this:
>>
>> __ set(rs, cardtable); // cardtable := <card
>> table base>
>> __ ldub(addr, cardtable, tmp); // tmp := [addr +
>> cardtable]
>>
>> + __ cmp_and_br_short(tmp,
>> G1SATBCardTableModRefBS::g1___young_card_val(), Assembler::equal,
>> Assembler::pt, young_card);
>> +
>> + __
>> membar(Assembler::Membar_mask___bits(Assembler::StoreLoad));
>> + __ ldub(addr, cardtable, tmp); // tmp := [addr +
>> cardtable]
>>
>> You load the cardtable with no barrier, checks its value and if
>> it seems
>> "wrong" (?) you then issue the storeload barrier and try again.
>> Why not
>> just insert the storeload barrier between the store and the load?
>>
>> Similar comment for src/cpu/sparc/vm/__macroAssembler_sparc.cpp.
>>
>> ---
>>
>> src/share/vm/gc___implementation/g1/__g1SATBCardTableModRefBS.cpp:
>>
>> void
>> G1SATBCardTableLoggingModRefBS__::write_ref_field_work(void*
>> field,
>> oop
>> new_val) {
>> ! volatile jbyte* byte = byte_for(field);
>> ! if (*byte == g1_young_gen) {
>> ! return;
>> ! }
>> ! OrderAccess::storeload();
>>
>> where is the store that you are ordering with?
>>
>> Similarly:
>>
>> ! for (; byte <= last_byte; byte++) {
>> ! if (*byte == g1_young_gen) {
>> ! continue;
>> ! }
>> ! OrderAccess::storeload();
>> if (*byte != dirty_card) {
>> *byte = dirty_card;
>> jt->dirty_card_queue().__enqueue(byte);
>> }
>> }
>>
>> the store seems to be after the barrier ?
>>
>>
>>
>> Let me take one step back and describe the situation more thouroghly.
>> The post-barrier for G1 has the following structure before my change
>> (o is a java object, f is a field in o and x is another java object)
>>
>> o.f = x
>> if (*card_for(o) == dirty) then goto done
>> *card_for(o) = dirty
>> enqueue(card_for(o))
>> done:
>>
>> The memory ordering problem can occur between the store
>> o.f = x and the read of *card_for(o)
>> This is because the concurrent refinement thread will set a card to
>> "clean" before it starts scanning the card for references, but that
>> store is not guaranteed to be visible to the thread executing the
>> post-barrier.
>> So if the CPU re-orders the load from card_for(o) to before the
>> write of o.f the refinement thread may have scanned past the address
>> of o.f but the application thread sees the stale "dirty" value and
>> skips the barrier.
>>
>>
>> The simple fix for this race is then to insert a storeload between
>> the store and the load:
>> o.f = x
>> #StoreLoad
>> if (*card_for(o) == dirty) then goto done
>> *card_for(o) = dirty
>> enqueue(card_for(o))
>> done:
>>
>> Inserting the storeload is too expensive since this code is in the
>> object field write fast-path. So I introduced a way to exclude some
>> writes by relying on a new card value which I know is not suspect to
>> concurrent updates:
>>
>> o.f = x
>> if (*card_for(o) == g1_young) then goto done
>> #StoreLoad (because I can only rely on the value "dirty" if the
>> store was committed before the following read)
>> if (*card_for(o) == dirty) then goto done
>> *card_for(o) = dirty
>> enqueue(card_for(o))
>> done:
>>
>> I hope this clarifies the issue and the reasoning behind my changes.
>>
>> /Mikael
>>
>>
>> Cheers,
>> David
>>
>> On 2/10/2013 10:28 PM, Mikael Gerdin wrote:
>>
>> Hi
>>
>> Please review my fix for the issue discussed in the "G1
>> question:
>> concurrent cleaning of dirty cards" thread on the
>> hotspot-gc-dev mailing
>> list.
>>
>> I'd like someone from the compiler (and runtime? the
>> interpreter uses
>> macroAssembler_*, right?) teams to at least look at the
>> changes to:
>> macroAssembler_*.cpp
>> c1_Runtime_*.cpp
>> graphKit.cpp
>>
>> Problem description:
>> G1 has a race where the concurrent refinement thread may
>> miss object
>> references in a dirty card.
>> The problem arises if the CPU re-orders the load of the old
>> card value
>> (which G1 checks to determine if it can skip the barrier)
>> before the store to the actual object.
>> If that occurs the concurrent refinement thread may have set
>> the card to
>> "clean" and proceeded to scan the card but the java thread
>> may have seen
>> the "dirty" value and skipped the post barrier.
>>
>> Suggested fix:
>> * Add a memory fence between the store to a java object and
>> the reading
>> of the previous card value.
>> * Modify the code for handling young regions so that all
>> writes to young
>> regions can skip the fence (since it will never be needed
>> for such
>> writes). This introduces a new value in the card table for
>> G1 which
>> indicates a young region.
>>
>> Performance impact:
>> * This fix has a negative throughput performance impact of
>> 1-1.5%
>> (tested on x86-AMD x86-Intel and SPARC).
>> * We may want to try to get rid of this race at some point by
>> redesigning G1's post barrier but there is not enough time
>> to do that
>> for JDK8.
>>
>> Performance numbers for x86 platforms can be seen here:
>> http://cr.openjdk.java.net/~__mgerdin/8014555/perf.txt
>> <http://cr.openjdk.java.net/~mgerdin/8014555/perf.txt>
>>
>> Unfortunately the JIRA issue is not externally visible, but
>> the major
>> parts of the discussions about this are present in the
>> mailing list
>> thread. The bug mostly contains my analysis of the crashes
>> which seems
>> to have been caused by this bug.
>>
>> Bug link: https://bugs.openjdk.java.net/__browse/JDK-8014555
>> <https://bugs.openjdk.java.net/browse/JDK-8014555>
>>
>> Webrev:
>> http://cr.openjdk.java.net/~__mgerdin/8014555/webrev.0
>> <http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0>
>>
>>
>
More information about the hotspot-dev
mailing list