RFR: JDK-8058209: Race in G1 card scanning could allow scanning of memory covered by PLABs

Daniel D. Daugherty daniel.daugherty at oracle.com
Thu Nov 13 14:35:46 UTC 2014


 > Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/

src/share/vm/gc_implementation/g1/heapRegion.cpp
     nit line 1007: assert(local_time_stamp <= g1h->get_gc_time_stamp(), 
"invariant" );
         You got rid of the space after the '('. Can you also get
         rid of the space before ');'?

     Thanks for putting _gc_time_stamp in a local so that it's
     not fetched more than once.

     Would it be useful to move the assert() after the setting of
     local_top? That would give someone debugging the assertion
     failure core file a little more context. Dunno... your call.

Dan


On 11/13/14 3:19 AM, Mikael Gerdin wrote:
> Hi David,
>
> On 2014-11-13 05:54, David Holmes wrote:
>> Hi Mikael,
>>
>> Without knowing the details it is hard to determine the correctness of
>> this. What you describe below sounds reasonable - but what about the
>> opposite problem in the new code: what if you read an old top() then a
>> new timestamp, before top() is updated? Will that work correctly or will
>> the region between the old-top and new-top be missed?
>
> I realize that not everyone is up to speed on the specifics of this 
> code, but I appreciate you feedback on the general reasoning.
>
> Reading an old _top value is safe, and in fact we must enforce that 
> the only _top values we ever return from this functions were set 
> before the GC occurred.
>
> Reading a too recent _top value is the cause of the crash in this bug, 
> since if this function returns a a recently updated _top value that is 
> because another GC worker has allocated into this region and is in the 
> process of copying objects into it. The point of the timestamp value 
> is to only return old values of _top and if the timestamp is current 
> it should return another value.
>
> I've updated the webrev slightly due to off-list feedback that I 
> should attempt to avoid reading the time stamp more than once (for the 
> assert).
> I've also noticed that I messed up the indentation of a curly brace so 
> I fixed that as well.
>
> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/
>
> Incremental webrev:
> http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0_to_1/
>
> /Mikael
>
>>
>> Cheers,
>> David H.
>>
>> On 12/11/2014 1:18 AM, Mikael Gerdin wrote:
>>> Hi all,
>>>
>>> I've sent this to hotspot-dev instead of just hotspot-gc-dev in the 
>>> hope
>>> of getting some extra feedback from our resident concurrency experts.
>>>
>>> Please review this subtle change to the order in which we read 
>>> fields in
>>> G1OffsetTableContigSpace::saved_mark_word, original included here for
>>> reference:
>>> 1003
>>> 1004 HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>> 1005   G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>> 1006   assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), 
>>> "invariant" );
>>> 1007   if (_gc_time_stamp < g1h->get_gc_time_stamp())
>>> 1008     return top();
>>> 1009   else
>>> 1010     return Space::saved_mark_word();
>>> 1011 }
>>> 1012
>>>
>>> When getting a new gc alloc region several stores are performed where
>>> store ordering needs to be enforced and several synchronization points
>>> occur.
>>> [write path]
>>> ST(_saved_mark_word)
>>> #StoreStore
>>> ST(_gc_time_stamp)
>>> ST(_top) // satisfying alloc request
>>> #StoreStore
>>> ST(_alloc_region) // publishing to other gc workers
>>> #MonitorEnter
>>> ST(_top) // potential further allocations
>>> #MonitorExit
>>> #MonitorEnter
>>> ST(_top) // potential further allocations
>>> #MonitorExit
>>>
>>> When we inspect a region during remembered set scanning we need to
>>> ensure that we never read memory which have been allocated by a GC
>>> worker thread for the purpose of copying objects into.
>>> The way this works is that a time stamp field is supposed to signal 
>>> to a
>>> scanning thread that it should look at addresses below _top if the time
>>> stamp is old or addresses below _saved_mark_word if the time stamp is
>>> current.
>>>
>>> The current code does (as seen above)
>>> [read path]
>>> LD(_gc_time_stamp)
>>> LD(_top)
>>> or (depending on time stamp)
>>> LD(_saved_mark_word)
>>>
>>> Because these values are written to without full mutual exclusion we
>>> need to be very careful about the order in which we read these values,
>>> and this is where I argue that the current code is incorrect.
>>> In order to observe a consistent view of the ordered stores in the
>>> [write path] above we need to load the values in the reverse order they
>>> were written, with proper #LoadLoad ordering enforced.
>>>
>>> The problem which we've observed here is that after we've read the time
>>> stamp as below the heap time stamp the top pointer can be updated by a
>>> GC worker allocating objects into this region. To make sure that the 
>>> top
>>> value we see is in fact valid we must read it before we read the time
>>> stamp which determines which value we should return from the
>>> saved_mark_word function.
>>>
>>> My suggested fix is to load _top first and enforce #LoadLoad ordering
>>> enforced:
>>> HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>>    G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>>    assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" );
>>>    HeapWord* local_top = top();
>>>    OrderAccess::loadload();
>>>    if (_gc_time_stamp < g1h->get_gc_time_stamp()) {
>>>      return local_top;
>>>    } else {
>>>       return Space::saved_mark_word();
>>>    }
>>> }
>>>
>>> I've successfully reproduced the crash with the original code by adding
>>> some random sleep calls between the load of the time stamp and the load
>>> of top so I'm fairly certain that this resolves the issue. I've also
>>> verified that the fix I'm proposing does resolve the bug for the team
>>> which encountered the issue, even if I can't reproduce that crash
>>> locally.
>>>
>>> I also plan to attempt design around some of the races in this code to
>>> reduce its complexity, but for the sake of backporting the fix to 8u40
>>> I'd like to start with just adding the minimal fix.
>>>
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8058209
>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0/
>>> Testing: JPRT, local kitchensink (4 hours), gc test suite
>>>
>>> Thanks
>>> /Mikael



More information about the hotspot-dev mailing list