RFR(M) JDK-8014555: G1: Memory ordering problem with Conc refinement and card marking

Mikael Gerdin mikael.gerdin at oracle.com
Fri Oct 4 06:03:05 PDT 2013


David,

On 10/04/2013 02:32 PM, David Holmes wrote:
> Hi Mikael,
>
> Thanks for the explanation. See inline ...
>
> On 4/10/2013 10:17 PM, Mikael Gerdin wrote:
>> David,
>>
>> On 10/04/2013 11:15 AM, David Holmes wrote:
>>> Hi Mikael,
>>>
>>> src/cpu/sparc/vm/c1_Runtime1_sparc.cpp:
>>>
>>> I don't quite follow the logic for this:
>>>
>>>            __ set(rs, cardtable);         // cardtable := <card table
>>> base>
>>>            __ ldub(addr, cardtable, tmp); // tmp := [addr + cardtable]
>>>
>>> +         __ cmp_and_br_short(tmp,
>>> G1SATBCardTableModRefBS::g1_young_card_val(), Assembler::equal,
>>> Assembler::pt, young_card);
>>> +
>>> +         __ membar(Assembler::Membar_mask_bits(Assembler::StoreLoad));
>>> +         __ ldub(addr, cardtable, tmp); // tmp := [addr + cardtable]
>>>
>>> You load the cardtable with no barrier, checks its value and if it seems
>>> "wrong" (?) you then issue the storeload barrier and try again. Why not
>>> just insert the storeload barrier between the store and the load?
>>>
>>> Similar comment for src/cpu/sparc/vm/macroAssembler_sparc.cpp.
>>>
>
> I can see how what you explained fits in here but ...
>
>>> ---
>>>
>>> src/share/vm/gc_implementation/g1/g1SATBCardTableModRefBS.cpp:
>>>
>>>    void
>>>    G1SATBCardTableLoggingModRefBS::write_ref_field_work(void* field,
>>>                                                         oop new_val) {
>>> !   volatile jbyte* byte = byte_for(field);
>>> !   if (*byte == g1_young_gen) {
>>> !     return;
>>> !   }
>>> !   OrderAccess::storeload();
>>>
>>> where is the store that you are ordering with?
>
> not here - where is the store?

write_ref_field_work() and invalidate() are the c++ implementation of 
the write barrier.
The store in this case is done in a calling frame, the call path looks 
something like:

G1SATBCardTableLoggingModRefBS::write_ref_field_work
BarrierSet::write_ref_field
update_barrier_set
oop_store

The store you are looking for is the encode_store_heap_oop here:
template <class T> inline void oop_store(T* p, oop v) {
   if (always_do_update_barrier) {
     oop_store((volatile T*)p, v);
   } else {
     update_barrier_set_pre(p, v);
     oopDesc::encode_store_heap_oop(p, v);
     update_barrier_set((void*)p, v);  // cast away type
   }
}

>
>>> Similarly:
>>>
>>> !       for (; byte <= last_byte; byte++) {
>>> !         if (*byte == g1_young_gen) {
>>> !           continue;
>>> !         }
>>> !         OrderAccess::storeload();
>>>            if (*byte != dirty_card) {
>>>              *byte = dirty_card;
>>>              jt->dirty_card_queue().enqueue(byte);
>>>            }
>>>          }
>>>
>>> the store seems to be after the barrier ?
>
> Nor here - which store and load are we ordering?

In the case of invalidate() the call path is for example:

G1SATBCardTableLoggingModRefBS::invalidate
G1SATBCardTableLoggingModRefBS::write_ref_array_work
BarrierSet::write_ref_array
ObjArrayKlass::do_copy

where the writes you are looking for are in Copy::conjoint_oops_atomic 
or in the inline copy loop.

/Mikael

>
> Thanks,
> David
> -----
>
>>
>> Let me take one step back and describe the situation more thouroghly.
>> The post-barrier for G1 has the following structure before my change
>> (o is a java object, f is a field in o and x is another java object)
>>
>> o.f = x
>> if (*card_for(o) == dirty) then goto done
>> *card_for(o) = dirty
>> enqueue(card_for(o))
>> done:
>>
>> The memory ordering problem can occur between the store
>> o.f = x and the read of *card_for(o)
>> This is because the concurrent refinement thread will set a card to
>> "clean" before it starts scanning the card for references, but that
>> store is not guaranteed to be visible to the thread executing the
>> post-barrier.
>> So if the CPU re-orders the load from card_for(o) to before the write of
>> o.f the refinement thread may have scanned past the address of o.f but
>> the application thread sees the stale "dirty" value and skips the
>> barrier.
>>
>>
>> The simple fix for this race is then to insert a storeload between the
>> store and the load:
>> o.f = x
>> #StoreLoad
>> if (*card_for(o) == dirty) then goto done
>> *card_for(o) = dirty
>> enqueue(card_for(o))
>> done:
>>
>> Inserting the storeload is too expensive since this code is in the
>> object field write fast-path. So I introduced a way to exclude some
>> writes by relying on a new card value which I know is not suspect to
>> concurrent updates:
>>
>> o.f = x
>> if (*card_for(o) == g1_young) then goto done
>> #StoreLoad (because I can only rely on the value "dirty" if the store
>> was committed before the following read)
>> if (*card_for(o) == dirty) then goto done
>> *card_for(o) = dirty
>> enqueue(card_for(o))
>> done:
>>
>> I hope this clarifies the issue and the reasoning behind my changes.
>>
>> /Mikael
>>
>>>
>>> Cheers,
>>> David
>>>
>>> On 2/10/2013 10:28 PM, Mikael Gerdin wrote:
>>>> Hi
>>>>
>>>> Please review my fix for the issue discussed in the "G1 question:
>>>> concurrent cleaning of dirty cards" thread on the hotspot-gc-dev
>>>> mailing
>>>> list.
>>>>
>>>> I'd like someone from the compiler (and runtime? the interpreter uses
>>>> macroAssembler_*, right?) teams to at least look at the changes to:
>>>> macroAssembler_*.cpp
>>>> c1_Runtime_*.cpp
>>>> graphKit.cpp
>>>>
>>>> Problem description:
>>>> G1 has a race where the concurrent refinement thread may miss object
>>>> references in a dirty card.
>>>> The problem arises if the CPU re-orders the load of the old card value
>>>> (which G1 checks to determine if it can skip the barrier)
>>>> before the store to the actual object.
>>>> If that occurs the concurrent refinement thread may have set the
>>>> card to
>>>> "clean" and proceeded to scan the card but the java thread may have
>>>> seen
>>>> the "dirty" value and skipped the post barrier.
>>>>
>>>> Suggested fix:
>>>> * Add a memory fence between the store to a java object and the reading
>>>> of the previous card value.
>>>> * Modify the code for handling young regions so that all writes to
>>>> young
>>>> regions can skip the fence (since it will never be needed for such
>>>> writes). This introduces a new value in the card table for G1 which
>>>> indicates a young region.
>>>>
>>>> Performance impact:
>>>> * This fix has a negative throughput performance impact of 1-1.5%
>>>> (tested on x86-AMD x86-Intel and SPARC).
>>>> * We may want to try to get rid of this race at some point by
>>>> redesigning G1's post barrier but there is not enough time to do that
>>>> for JDK8.
>>>>
>>>> Performance numbers for x86 platforms can be seen here:
>>>> http://cr.openjdk.java.net/~mgerdin/8014555/perf.txt
>>>>
>>>> Unfortunately the JIRA issue is not externally visible, but the major
>>>> parts of the discussions about this are present in the mailing list
>>>> thread. The bug mostly contains my analysis of the crashes which seems
>>>> to have been caused by this bug.
>>>>
>>>> Bug link: https://bugs.openjdk.java.net/browse/JDK-8014555
>>>>
>>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8014555/webrev.0
>>>>
>>



More information about the hotspot-dev mailing list