RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking

Tue Oct 8 01:05:23 PDT 2013

Martin,

On 10/07/2013 06:04 PM, Doerr, Martin wrote:
> Hi Mikael,
>
> it's even better this way.

Thanks for reviewing!
/Mikael

>
> Thanks for adding my proposal so quickly,
> Martin
>
> -----Original Message-----
> From: Mikael Gerdin [mailto:mikael.gerdin at oracle.com]
> Sent: Montag, 7. Oktober 2013 17:49
> To: Doerr, Martin; Vitaly Davidovich
> Cc: David Holmes; hotspot-dev developers
> Subject: Re: RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking
>
> Hi Martin,
>
> On 10/07/2013 04:13 PM, Doerr, Martin wrote:
>> Hi Mikael,
>>
>> your proposed webrev looks good. It does prevent the problem which I had reported.
>> And I like the card table based recognition of young regions.
>>
>> I have only one comment on the new implementation of G1SATBCardTableLoggingModRefBS::invalidate.
>> It's not necessary to execute the store-load-barrier inside the loops. E.g. it could be implemented this way:
>> // Skip young gen.
>> while ((byte <= last_byte) && (*byte == g1_young_gen)) byte++;
>> if (byte <= last_byte) {
>>     // Old gen is included. Must fence before checking cards.
>>     OrderAccess::storeload();
>>     for (; byte <= last_byte; byte++) {
>>       if (*byte == g1_young_gen) {
>>         continue;
>>       }
>>       if (*byte != dirty_card) {
>>         *byte = dirty_card;
>>         jt->dirty_card_queue().enqueue(byte);
>>       }
>>     }
>> }
>
> Ahh, you are correct. I like your idea but I modified it slightly.
>
> Incremental webrev:
> http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0.to.1/
> Full webrev:
> http://cr.openjdk.java.net/~mgerdin/8014555/webrev.1/
>
> /Mikael
>
>>
>> (I guess the MutexLockerEx already contains some kind of store-load-ordering, but I'd keep the additional
>> OrderAccess::storeload() in the non-java-thread case to be on the safe side.)
>> It's probably not a big deal. Just wanted to share this thought.
>
> You are probably correct, but I'd like to keep the code consistent and I
> dont think calling invalidate() from the VM thread is a hot path.
>
> /Mikael
>
>>
>> @Vitaly: The original problem was discussed in the thread "G1 question: concurrent cleaning of dirty cards".
>> I had posted the following scenario:
>> Java Thread (mutator)              Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)
>>
>> (1)  store(oop)
>> ( StoreLoad required here ?)
>> (2)  load(card==dirty)
>>
>>                                      (1') store(card==clean)
>>                                      (2') StoreLoad barrier
>>                                      (3') load(oop)
>>
>> Best regards,
>> Martin
>>
>>
>> -----Original Message-----
>> From: hotspot-dev-bounces at openjdk.java.net [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Mikael Gerdin
>> Sent: Montag, 7. Oktober 2013 14:25
>> To: Vitaly Davidovich
>> Cc: David Holmes; hotspot-dev developers
>> Subject: Re: RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking
>>
>> Vitaly,
>>
>> On 10/04/2013 03:32 PM, Vitaly Davidovich wrote:
>>> Mikael,
>>>
>>> What if you were to read the card value before the store and then check
>>> if it's clean and mark it dirty + enqueue? This would be a loadstore
>>> fence which can be noop in some archs.  Or am I missing something?
>>
>> The problem is that we need to make sure that the refinement thread
>> finds our reference when it cleans our card.
>> In order to determine that we need to read the card value from memory
>> _after_ our store to the java object is known to be visible.
>> That way we know that the refinement has not started scanning the card
>> and therefore won't miss our reference.
>>
>> If we read the card value before the store, found that it was already
>> dirty and skipped enqueuing then we are enlarging the window for when
>> the refinement thread may start scanning the card and scan past the
>> address where we would store to the java object.
>>
>> /Mikael
>>
>>>
>>> Thanks
>>>
>>> Sent from my phone
>>>
>>> On Oct 4, 2013 8:18 AM, "Mikael Gerdin" <mikael.gerdin at oracle.com
>>> <mailto:mikael.gerdin at oracle.com>> wrote:
>>>
>>>       David,
>>>
>>>       On 10/04/2013 11:15 AM, David Holmes wrote:
>>>
>>>           Hi Mikael,
>>>
>>>           src/cpu/sparc/vm/c1_Runtime1___sparc.cpp:
>>>
>>>           I don't quite follow the logic for this:
>>>
>>>                       __ set(rs, cardtable);         // cardtable := <card
>>>           table base>
>>>                       __ ldub(addr, cardtable, tmp); // tmp := [addr +
>>>           cardtable]
>>>
>>>           +         __ cmp_and_br_short(tmp,
>>>           G1SATBCardTableModRefBS::g1___young_card_val(), Assembler::equal,
>>>           Assembler::pt, young_card);
>>>           +
>>>           +         __
>>>           membar(Assembler::Membar_mask___bits(Assembler::StoreLoad));
>>>           +         __ ldub(addr, cardtable, tmp); // tmp := [addr +
>>>           cardtable]
>>>
>>>           You load the cardtable with no barrier, checks its value and if
>>>           it seems
>>>           "wrong" (?) you then issue the storeload barrier and try again.
>>>           Why not
>>>           just insert the storeload barrier between the store and the load?
>>>
>>>           Similar comment for src/cpu/sparc/vm/__macroAssembler_sparc.cpp.
>>>
>>>           ---
>>>
>>>           src/share/vm/gc___implementation/g1/__g1SATBCardTableModRefBS.cpp:
>>>
>>>               void
>>>               G1SATBCardTableLoggingModRefBS__::write_ref_field_work(void*
>>>           field,
>>>                                                                    oop
>>>           new_val) {
>>>           !   volatile jbyte* byte = byte_for(field);
>>>           !   if (*byte == g1_young_gen) {
>>>           !     return;
>>>           !   }
>>>           !   OrderAccess::storeload();
>>>
>>>           where is the store that you are ordering with?
>>>
>>>           Similarly:
>>>
>>>           !       for (; byte <= last_byte; byte++) {
>>>           !         if (*byte == g1_young_gen) {
>>>           !           continue;
>>>           !         }
>>>           !         OrderAccess::storeload();
>>>                       if (*byte != dirty_card) {
>>>                         *byte = dirty_card;
>>>                         jt->dirty_card_queue().__enqueue(byte);
>>>                       }
>>>                     }
>>>
>>>           the store seems to be after the barrier ?
>>>
>>>
>>>
>>>       Let me take one step back and describe the situation more thouroghly.
>>>       The post-barrier for G1 has the following structure before my change
>>>       (o is a java object, f is a field in o and x is another java object)
>>>
>>>       o.f = x
>>>       if (*card_for(o) == dirty) then goto done
>>>       *card_for(o) = dirty
>>>       enqueue(card_for(o))
>>>       done:
>>>
>>>       The memory ordering problem can occur between the store
>>>       o.f = x and the read of *card_for(o)
>>>       This is because the concurrent refinement thread will set a card to
>>>       "clean" before it starts scanning the card for references, but that
>>>       store is not guaranteed to be visible to the thread executing the
>>>       post-barrier.
>>>       So if the CPU re-orders the load from card_for(o) to before the
>>>       write of o.f the refinement thread may have scanned past the address
>>>       of o.f but the application thread sees the stale "dirty" value and
>>>       skips the barrier.
>>>
>>>
>>>       The simple fix for this race is then to insert a storeload between
>>>       the store and the load:
>>>       o.f = x
>>>       #StoreLoad
>>>       if (*card_for(o) == dirty) then goto done
>>>       *card_for(o) = dirty
>>>       enqueue(card_for(o))
>>>       done:
>>>
>>>       Inserting the storeload is too expensive since this code is in the
>>>       object field write fast-path. So I introduced a way to exclude some
>>>       writes by relying on a new card value which I know is not suspect to
>>>       concurrent updates:
>>>
>>>       o.f = x
>>>       if (*card_for(o) == g1_young) then goto done
>>>       #StoreLoad (because I can only rely on the value "dirty" if the
>>>       store was committed before the following read)
>>>       if (*card_for(o) == dirty) then goto done
>>>       *card_for(o) = dirty
>>>       enqueue(card_for(o))
>>>       done:
>>>
>>>       I hope this clarifies the issue and the reasoning behind my changes.
>>>
>>>       /Mikael
>>>
>>>
>>>           Cheers,
>>>           David
>>>
>>>           On 2/10/2013 10:28 PM, Mikael Gerdin wrote:
>>>
>>>               Hi
>>>
>>>               Please review my fix for the issue discussed in the "G1
>>>               question:
>>>               concurrent cleaning of dirty cards" thread on the
>>>               hotspot-gc-dev mailing
>>>               list.
>>>
>>>               I'd like someone from the compiler (and runtime? the
>>>               interpreter uses
>>>               macroAssembler_*, right?) teams to at least look at the
>>>               changes to:
>>>               macroAssembler_*.cpp
>>>               c1_Runtime_*.cpp
>>>               graphKit.cpp
>>>
>>>               Problem description:
>>>               G1 has a race where the concurrent refinement thread may
>>>               miss object
>>>               references in a dirty card.
>>>               The problem arises if the CPU re-orders the load of the old
>>>               card value
>>>               (which G1 checks to determine if it can skip the barrier)
>>>               before the store to the actual object.
>>>               If that occurs the concurrent refinement thread may have set
>>>               the card to
>>>               "clean" and proceeded to scan the card but the java thread
>>>               may have seen
>>>               the "dirty" value and skipped the post barrier.
>>>
>>>               Suggested fix:
>>>               * Add a memory fence between the store to a java object and
>>>               the reading
>>>               of the previous card value.
>>>               * Modify the code for handling young regions so that all
>>>               writes to young
>>>               regions can skip the fence (since it will never be needed
>>>               for such
>>>               writes). This introduces a new value in the card table for
>>>               G1 which
>>>               indicates a young region.
>>>
>>>               Performance impact:
>>>               * This fix has a negative throughput performance impact of
>>>               1-1.5%
>>>               (tested on x86-AMD x86-Intel and SPARC).
>>>               * We may want to try to get rid of this race at some point by
>>>               redesigning G1's post barrier but there is not enough time
>>>               to do that
>>>               for JDK8.
>>>
>>>               Performance numbers for x86 platforms can be seen here:
>>>               http://cr.openjdk.java.net/~__mgerdin/8014555/perf.txt
>>>               <http://cr.openjdk.java.net/~mgerdin/8014555/perf.txt>
>>>
>>>               Unfortunately the JIRA issue is not externally visible, but
>>>               the major
>>>               parts of the discussions about this are present in the
>>>               mailing list
>>>               thread. The bug mostly contains my analysis of the crashes
>>>               which seems
>>>               to have been caused by this bug.
>>>
>>>               Bug link: https://bugs.openjdk.java.net/__browse/JDK-8014555
>>>               <https://bugs.openjdk.java.net/browse/JDK-8014555>
>>>
>>>               Webrev:
>>>               http://cr.openjdk.java.net/~__mgerdin/8014555/webrev.0
>>>               <http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0>
>>>
>>>
>>
>