RFR: JDK-8058209: Race in G1 card scanning could allow scanning of memory covered by PLABs

Thu Nov 13 11:50:27 UTC 2014

On 13/11/2014 8:19 PM, Mikael Gerdin wrote:
> Hi David,
>
> On 2014-11-13 05:54, David Holmes wrote:
>> Hi Mikael,
>>
>> Without knowing the details it is hard to determine the correctness of
>> this. What you describe below sounds reasonable - but what about the
>> opposite problem in the new code: what if you read an old top() then a
>> new timestamp, before top() is updated? Will that work correctly or will
>> the region between the old-top and new-top be missed?
>
> I realize that not everyone is up to speed on the specifics of this
> code, but I appreciate you feedback on the general reasoning.
>
> Reading an old _top value is safe, and in fact we must enforce that the
> only _top values we ever return from this functions were set before the
> GC occurred.
>
> Reading a too recent _top value is the cause of the crash in this bug,
> since if this function returns a a recently updated _top value that is
> because another GC worker has allocated into this region and is in the
> process of copying objects into it. The point of the timestamp value is
> to only return old values of _top and if the timestamp is current it
> should return another value.

Ok. I assumed there was some kind of monotonic progression that made 
this okay but as I said I'm not at all familiar with the code.

> I've updated the webrev slightly due to off-list feedback that I should
> attempt to avoid reading the time stamp more than once (for the assert).

I noticed that too but I was looking at the reading of 
g1h->get_gc_time_stamp() twice, not the reading of _gc_time_stamp. 
Shouldn't you read them both once eg:

     HeapWord* local_top = top();
     OrderAccess::loadload();
     const unsigned local_time_stamp = _gc_time_stamp;
     const unsigned gc_time_stamp = g1h->get_gc_time_stamp();
     assert(local_time_stamp <= gc_time_stamp, "invariant" );
     if (local_time_stamp < gc_time_stamp) {
       return local_top;
     } else {
       return Space::saved_mark_word();
     }

?

David
-----

> I've also noticed that I messed up the indentation of a curly brace so I
> fixed that as well.
>
> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/
>
> Incremental webrev:
> http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0_to_1/
>
> /Mikael
>
>>
>> Cheers,
>> David H.
>>
>> On 12/11/2014 1:18 AM, Mikael Gerdin wrote:
>>> Hi all,
>>>
>>> I've sent this to hotspot-dev instead of just hotspot-gc-dev in the hope
>>> of getting some extra feedback from our resident concurrency experts.
>>>
>>> Please review this subtle change to the order in which we read fields in
>>> G1OffsetTableContigSpace::saved_mark_word, original included here for
>>> reference:
>>> 1003
>>> 1004 HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>> 1005   G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>> 1006   assert( _gc_time_stamp <= g1h->get_gc_time_stamp(),
>>> "invariant" );
>>> 1007   if (_gc_time_stamp < g1h->get_gc_time_stamp())
>>> 1008     return top();
>>> 1009   else
>>> 1010     return Space::saved_mark_word();
>>> 1011 }
>>> 1012
>>>
>>> When getting a new gc alloc region several stores are performed where
>>> store ordering needs to be enforced and several synchronization points
>>> occur.
>>> [write path]
>>> ST(_saved_mark_word)
>>> #StoreStore
>>> ST(_gc_time_stamp)
>>> ST(_top) // satisfying alloc request
>>> #StoreStore
>>> ST(_alloc_region) // publishing to other gc workers
>>> #MonitorEnter
>>> ST(_top) // potential further allocations
>>> #MonitorExit
>>> #MonitorEnter
>>> ST(_top) // potential further allocations
>>> #MonitorExit
>>>
>>> When we inspect a region during remembered set scanning we need to
>>> ensure that we never read memory which have been allocated by a GC
>>> worker thread for the purpose of copying objects into.
>>> The way this works is that a time stamp field is supposed to signal to a
>>> scanning thread that it should look at addresses below _top if the time
>>> stamp is old or addresses below _saved_mark_word if the time stamp is
>>> current.
>>>
>>> The current code does (as seen above)
>>> [read path]
>>> LD(_gc_time_stamp)
>>> LD(_top)
>>> or (depending on time stamp)
>>> LD(_saved_mark_word)
>>>
>>> Because these values are written to without full mutual exclusion we
>>> need to be very careful about the order in which we read these values,
>>> and this is where I argue that the current code is incorrect.
>>> In order to observe a consistent view of the ordered stores in the
>>> [write path] above we need to load the values in the reverse order they
>>> were written, with proper #LoadLoad ordering enforced.
>>>
>>> The problem which we've observed here is that after we've read the time
>>> stamp as below the heap time stamp the top pointer can be updated by a
>>> GC worker allocating objects into this region. To make sure that the top
>>> value we see is in fact valid we must read it before we read the time
>>> stamp which determines which value we should return from the
>>> saved_mark_word function.
>>>
>>> My suggested fix is to load _top first and enforce #LoadLoad ordering
>>> enforced:
>>> HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>>    G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>>    assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" );
>>>    HeapWord* local_top = top();
>>>    OrderAccess::loadload();
>>>    if (_gc_time_stamp < g1h->get_gc_time_stamp()) {
>>>      return local_top;
>>>    } else {
>>>       return Space::saved_mark_word();
>>>    }
>>> }
>>>
>>> I've successfully reproduced the crash with the original code by adding
>>> some random sleep calls between the load of the time stamp and the load
>>> of top so I'm fairly certain that this resolves the issue. I've also
>>> verified that the fix I'm proposing does resolve the bug for the team
>>> which encountered the issue, even if I can't reproduce that crash
>>> locally.
>>>
>>> I also plan to attempt design around some of the races in this code to
>>> reduce its complexity, but for the sake of backporting the fix to 8u40
>>> I'd like to start with just adding the minimal fix.
>>>
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8058209
>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0/
>>> Testing: JPRT, local kitchensink (4 hours), gc test suite
>>>
>>> Thanks
>>> /Mikael