RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking

Mon Oct 7 08:49:02 PDT 2013

Hi Martin,

On 10/07/2013 04:13 PM, Doerr, Martin wrote:
> Hi Mikael,
>
> your proposed webrev looks good. It does prevent the problem which I had reported.
> And I like the card table based recognition of young regions.
>
> I have only one comment on the new implementation of G1SATBCardTableLoggingModRefBS::invalidate.
> It's not necessary to execute the store-load-barrier inside the loops. E.g. it could be implemented this way:
> // Skip young gen.
> while ((byte <= last_byte) && (*byte == g1_young_gen)) byte++;
> if (byte <= last_byte) {
>    // Old gen is included. Must fence before checking cards.
>    OrderAccess::storeload();
>    for (; byte <= last_byte; byte++) {
>      if (*byte == g1_young_gen) {
>        continue;
>      }
>      if (*byte != dirty_card) {
>        *byte = dirty_card;
>        jt->dirty_card_queue().enqueue(byte);
>      }
>    }
> }

Ahh, you are correct. I like your idea but I modified it slightly.

Incremental webrev:
http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0.to.1/
Full webrev:
http://cr.openjdk.java.net/~mgerdin/8014555/webrev.1/

/Mikael

>
> (I guess the MutexLockerEx already contains some kind of store-load-ordering, but I'd keep the additional
> OrderAccess::storeload() in the non-java-thread case to be on the safe side.)
> It's probably not a big deal. Just wanted to share this thought.

You are probably correct, but I'd like to keep the code consistent and I 
dont think calling invalidate() from the VM thread is a hot path.

/Mikael

>
> @Vitaly: The original problem was discussed in the thread "G1 question: concurrent cleaning of dirty cards".
> I had posted the following scenario:
> Java Thread (mutator)              Refinement Thread (G1RemSet::concurrentRefineOneCard_impl calls oops_on_card_seq_iterate_careful)
>
> (1)  store(oop)
> ( StoreLoad required here ?)
> (2)  load(card==dirty)
>
>                                     (1') store(card==clean)
>                                     (2') StoreLoad barrier
>                                     (3') load(oop)
>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: hotspot-dev-bounces at openjdk.java.net [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Mikael Gerdin
> Sent: Montag, 7. Oktober 2013 14:25
> To: Vitaly Davidovich
> Cc: David Holmes; hotspot-dev developers
> Subject: Re: RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking
>
> Vitaly,
>
> On 10/04/2013 03:32 PM, Vitaly Davidovich wrote:
>> Mikael,
>>
>> What if you were to read the card value before the store and then check
>> if it's clean and mark it dirty + enqueue? This would be a loadstore
>> fence which can be noop in some archs.  Or am I missing something?
>
> The problem is that we need to make sure that the refinement thread
> finds our reference when it cleans our card.
> In order to determine that we need to read the card value from memory
> _after_ our store to the java object is known to be visible.
> That way we know that the refinement has not started scanning the card
> and therefore won't miss our reference.
>
> If we read the card value before the store, found that it was already
> dirty and skipped enqueuing then we are enlarging the window for when
> the refinement thread may start scanning the card and scan past the
> address where we would store to the java object.
>
> /Mikael
>
>>
>> Thanks
>>
>> Sent from my phone
>>
>> On Oct 4, 2013 8:18 AM, "Mikael Gerdin" <mikael.gerdin at oracle.com
>> <mailto:mikael.gerdin at oracle.com>> wrote:
>>
>>      David,
>>
>>      On 10/04/2013 11:15 AM, David Holmes wrote:
>>
>>          Hi Mikael,
>>
>>          src/cpu/sparc/vm/c1_Runtime1___sparc.cpp:
>>
>>          I don't quite follow the logic for this:
>>
>>                      __ set(rs, cardtable);         // cardtable := <card
>>          table base>
>>                      __ ldub(addr, cardtable, tmp); // tmp := [addr +
>>          cardtable]
>>
>>          +         __ cmp_and_br_short(tmp,
>>          G1SATBCardTableModRefBS::g1___young_card_val(), Assembler::equal,
>>          Assembler::pt, young_card);
>>          +
>>          +         __
>>          membar(Assembler::Membar_mask___bits(Assembler::StoreLoad));
>>          +         __ ldub(addr, cardtable, tmp); // tmp := [addr +
>>          cardtable]
>>
>>          You load the cardtable with no barrier, checks its value and if
>>          it seems
>>          "wrong" (?) you then issue the storeload barrier and try again.
>>          Why not
>>          just insert the storeload barrier between the store and the load?
>>
>>          Similar comment for src/cpu/sparc/vm/__macroAssembler_sparc.cpp.
>>
>>          ---
>>
>>          src/share/vm/gc___implementation/g1/__g1SATBCardTableModRefBS.cpp:
>>
>>              void
>>              G1SATBCardTableLoggingModRefBS__::write_ref_field_work(void*
>>          field,
>>                                                                   oop
>>          new_val) {
>>          !   volatile jbyte* byte = byte_for(field);
>>          !   if (*byte == g1_young_gen) {
>>          !     return;
>>          !   }
>>          !   OrderAccess::storeload();
>>
>>          where is the store that you are ordering with?
>>
>>          Similarly:
>>
>>          !       for (; byte <= last_byte; byte++) {
>>          !         if (*byte == g1_young_gen) {
>>          !           continue;
>>          !         }
>>          !         OrderAccess::storeload();
>>                      if (*byte != dirty_card) {
>>                        *byte = dirty_card;
>>                        jt->dirty_card_queue().__enqueue(byte);
>>                      }
>>                    }
>>
>>          the store seems to be after the barrier ?
>>
>>
>>
>>      Let me take one step back and describe the situation more thouroghly.
>>      The post-barrier for G1 has the following structure before my change
>>      (o is a java object, f is a field in o and x is another java object)
>>
>>      o.f = x
>>      if (*card_for(o) == dirty) then goto done
>>      *card_for(o) = dirty
>>      enqueue(card_for(o))
>>      done:
>>
>>      The memory ordering problem can occur between the store
>>      o.f = x and the read of *card_for(o)
>>      This is because the concurrent refinement thread will set a card to
>>      "clean" before it starts scanning the card for references, but that
>>      store is not guaranteed to be visible to the thread executing the
>>      post-barrier.
>>      So if the CPU re-orders the load from card_for(o) to before the
>>      write of o.f the refinement thread may have scanned past the address
>>      of o.f but the application thread sees the stale "dirty" value and
>>      skips the barrier.
>>
>>
>>      The simple fix for this race is then to insert a storeload between
>>      the store and the load:
>>      o.f = x
>>      #StoreLoad
>>      if (*card_for(o) == dirty) then goto done
>>      *card_for(o) = dirty
>>      enqueue(card_for(o))
>>      done:
>>
>>      Inserting the storeload is too expensive since this code is in the
>>      object field write fast-path. So I introduced a way to exclude some
>>      writes by relying on a new card value which I know is not suspect to
>>      concurrent updates:
>>
>>      o.f = x
>>      if (*card_for(o) == g1_young) then goto done
>>      #StoreLoad (because I can only rely on the value "dirty" if the
>>      store was committed before the following read)
>>      if (*card_for(o) == dirty) then goto done
>>      *card_for(o) = dirty
>>      enqueue(card_for(o))
>>      done:
>>
>>      I hope this clarifies the issue and the reasoning behind my changes.
>>
>>      /Mikael
>>
>>
>>          Cheers,
>>          David
>>
>>          On 2/10/2013 10:28 PM, Mikael Gerdin wrote:
>>
>>              Hi
>>
>>              Please review my fix for the issue discussed in the "G1
>>              question:
>>              concurrent cleaning of dirty cards" thread on the
>>              hotspot-gc-dev mailing
>>              list.
>>
>>              I'd like someone from the compiler (and runtime? the
>>              interpreter uses
>>              macroAssembler_*, right?) teams to at least look at the
>>              changes to:
>>              macroAssembler_*.cpp
>>              c1_Runtime_*.cpp
>>              graphKit.cpp
>>
>>              Problem description:
>>              G1 has a race where the concurrent refinement thread may
>>              miss object
>>              references in a dirty card.
>>              The problem arises if the CPU re-orders the load of the old
>>              card value
>>              (which G1 checks to determine if it can skip the barrier)
>>              before the store to the actual object.
>>              If that occurs the concurrent refinement thread may have set
>>              the card to
>>              "clean" and proceeded to scan the card but the java thread
>>              may have seen
>>              the "dirty" value and skipped the post barrier.
>>
>>              Suggested fix:
>>              * Add a memory fence between the store to a java object and
>>              the reading
>>              of the previous card value.
>>              * Modify the code for handling young regions so that all
>>              writes to young
>>              regions can skip the fence (since it will never be needed
>>              for such
>>              writes). This introduces a new value in the card table for
>>              G1 which
>>              indicates a young region.
>>
>>              Performance impact:
>>              * This fix has a negative throughput performance impact of
>>              1-1.5%
>>              (tested on x86-AMD x86-Intel and SPARC).
>>              * We may want to try to get rid of this race at some point by
>>              redesigning G1's post barrier but there is not enough time
>>              to do that
>>              for JDK8.
>>
>>              Performance numbers for x86 platforms can be seen here:
>>              http://cr.openjdk.java.net/~__mgerdin/8014555/perf.txt
>>              <http://cr.openjdk.java.net/~mgerdin/8014555/perf.txt>
>>
>>              Unfortunately the JIRA issue is not externally visible, but
>>              the major
>>              parts of the discussions about this are present in the
>>              mailing list
>>              thread. The bug mostly contains my analysis of the crashes
>>              which seems
>>              to have been caused by this bug.
>>
>>              Bug link: https://bugs.openjdk.java.net/__browse/JDK-8014555
>>              <https://bugs.openjdk.java.net/browse/JDK-8014555>
>>
>>              Webrev:
>>              http://cr.openjdk.java.net/~__mgerdin/8014555/webrev.0
>>              <http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0>
>>
>>
>