RFR: JDK-8058209: Race in G1 card scanning could allow scanning of memory covered by PLABs

Mikael Gerdin mikael.gerdin at oracle.com
Thu Nov 13 10:19:43 UTC 2014


Hi David,

On 2014-11-13 05:54, David Holmes wrote:
> Hi Mikael,
>
> Without knowing the details it is hard to determine the correctness of
> this. What you describe below sounds reasonable - but what about the
> opposite problem in the new code: what if you read an old top() then a
> new timestamp, before top() is updated? Will that work correctly or will
> the region between the old-top and new-top be missed?

I realize that not everyone is up to speed on the specifics of this 
code, but I appreciate you feedback on the general reasoning.

Reading an old _top value is safe, and in fact we must enforce that the 
only _top values we ever return from this functions were set before the 
GC occurred.

Reading a too recent _top value is the cause of the crash in this bug, 
since if this function returns a a recently updated _top value that is 
because another GC worker has allocated into this region and is in the 
process of copying objects into it. The point of the timestamp value is 
to only return old values of _top and if the timestamp is current it 
should return another value.

I've updated the webrev slightly due to off-list feedback that I should 
attempt to avoid reading the time stamp more than once (for the assert).
I've also noticed that I messed up the indentation of a curly brace so I 
fixed that as well.

Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/

Incremental webrev:
http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0_to_1/

/Mikael

>
> Cheers,
> David H.
>
> On 12/11/2014 1:18 AM, Mikael Gerdin wrote:
>> Hi all,
>>
>> I've sent this to hotspot-dev instead of just hotspot-gc-dev in the hope
>> of getting some extra feedback from our resident concurrency experts.
>>
>> Please review this subtle change to the order in which we read fields in
>> G1OffsetTableContigSpace::saved_mark_word, original included here for
>> reference:
>> 1003
>> 1004 HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>> 1005   G1CollectedHeap* g1h = G1CollectedHeap::heap();
>> 1006   assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" );
>> 1007   if (_gc_time_stamp < g1h->get_gc_time_stamp())
>> 1008     return top();
>> 1009   else
>> 1010     return Space::saved_mark_word();
>> 1011 }
>> 1012
>>
>> When getting a new gc alloc region several stores are performed where
>> store ordering needs to be enforced and several synchronization points
>> occur.
>> [write path]
>> ST(_saved_mark_word)
>> #StoreStore
>> ST(_gc_time_stamp)
>> ST(_top) // satisfying alloc request
>> #StoreStore
>> ST(_alloc_region) // publishing to other gc workers
>> #MonitorEnter
>> ST(_top) // potential further allocations
>> #MonitorExit
>> #MonitorEnter
>> ST(_top) // potential further allocations
>> #MonitorExit
>>
>> When we inspect a region during remembered set scanning we need to
>> ensure that we never read memory which have been allocated by a GC
>> worker thread for the purpose of copying objects into.
>> The way this works is that a time stamp field is supposed to signal to a
>> scanning thread that it should look at addresses below _top if the time
>> stamp is old or addresses below _saved_mark_word if the time stamp is
>> current.
>>
>> The current code does (as seen above)
>> [read path]
>> LD(_gc_time_stamp)
>> LD(_top)
>> or (depending on time stamp)
>> LD(_saved_mark_word)
>>
>> Because these values are written to without full mutual exclusion we
>> need to be very careful about the order in which we read these values,
>> and this is where I argue that the current code is incorrect.
>> In order to observe a consistent view of the ordered stores in the
>> [write path] above we need to load the values in the reverse order they
>> were written, with proper #LoadLoad ordering enforced.
>>
>> The problem which we've observed here is that after we've read the time
>> stamp as below the heap time stamp the top pointer can be updated by a
>> GC worker allocating objects into this region. To make sure that the top
>> value we see is in fact valid we must read it before we read the time
>> stamp which determines which value we should return from the
>> saved_mark_word function.
>>
>> My suggested fix is to load _top first and enforce #LoadLoad ordering
>> enforced:
>> HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>    G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>    assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" );
>>    HeapWord* local_top = top();
>>    OrderAccess::loadload();
>>    if (_gc_time_stamp < g1h->get_gc_time_stamp()) {
>>      return local_top;
>>    } else {
>>       return Space::saved_mark_word();
>>    }
>> }
>>
>> I've successfully reproduced the crash with the original code by adding
>> some random sleep calls between the load of the time stamp and the load
>> of top so I'm fairly certain that this resolves the issue. I've also
>> verified that the fix I'm proposing does resolve the bug for the team
>> which encountered the issue, even if I can't reproduce that crash
>> locally.
>>
>> I also plan to attempt design around some of the races in this code to
>> reduce its complexity, but for the sake of backporting the fix to 8u40
>> I'd like to start with just adding the minimal fix.
>>
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8058209
>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0/
>> Testing: JPRT, local kitchensink (4 hours), gc test suite
>>
>> Thanks
>> /Mikael


More information about the hotspot-dev mailing list