RFR: JDK-8058209: Race in G1 card scanning could allow scanning of memory covered by PLABs

Mikael Gerdin mikael.gerdin at oracle.com
Thu Nov 13 12:08:58 UTC 2014


On 2014-11-13 12:50, David Holmes wrote:
> On 13/11/2014 8:19 PM, Mikael Gerdin wrote:
>> Hi David,
>>
>> On 2014-11-13 05:54, David Holmes wrote:
>>> Hi Mikael,
>>>
>>> Without knowing the details it is hard to determine the correctness of
>>> this. What you describe below sounds reasonable - but what about the
>>> opposite problem in the new code: what if you read an old top() then a
>>> new timestamp, before top() is updated? Will that work correctly or will
>>> the region between the old-top and new-top be missed?
>>
>> I realize that not everyone is up to speed on the specifics of this
>> code, but I appreciate you feedback on the general reasoning.
>>
>> Reading an old _top value is safe, and in fact we must enforce that the
>> only _top values we ever return from this functions were set before the
>> GC occurred.
>>
>> Reading a too recent _top value is the cause of the crash in this bug,
>> since if this function returns a a recently updated _top value that is
>> because another GC worker has allocated into this region and is in the
>> process of copying objects into it. The point of the timestamp value is
>> to only return old values of _top and if the timestamp is current it
>> should return another value.
>
> Ok. I assumed there was some kind of monotonic progression that made
> this okay but as I said I'm not at all familiar with the code.
>
>> I've updated the webrev slightly due to off-list feedback that I should
>> attempt to avoid reading the time stamp more than once (for the assert).
>
> I noticed that too but I was looking at the reading of
> g1h->get_gc_time_stamp() twice, not the reading of _gc_time_stamp.
> Shouldn't you read them both once eg:
>
>      HeapWord* local_top = top();
>      OrderAccess::loadload();
>      const unsigned local_time_stamp = _gc_time_stamp;
>      const unsigned gc_time_stamp = g1h->get_gc_time_stamp();
>      assert(local_time_stamp <= gc_time_stamp, "invariant" );
>      if (local_time_stamp < gc_time_stamp) {
>        return local_top;
>      } else {
>        return Space::saved_mark_word();
>      }
>
> ?

The G1CollectedHeap time stamp is incremented once before any gc workers 
are notified, so it should not be possible to observe different values 
of the time stamp during a collection.
There is also a OrderAccess::fence call after incrementing the timestamp 
which I'm not sure is needed but I'll leave it in regardless.

The time stamp is also incremented after the collection has completed 
but those time stamps should not be visible to gc workers either.

/Mikael

>
> David
> -----
>
>> I've also noticed that I messed up the indentation of a curly brace so I
>> fixed that as well.
>>
>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/
>>
>> Incremental webrev:
>> http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0_to_1/
>>
>> /Mikael
>>
>>>
>>> Cheers,
>>> David H.
>>>
>>> On 12/11/2014 1:18 AM, Mikael Gerdin wrote:
>>>> Hi all,
>>>>
>>>> I've sent this to hotspot-dev instead of just hotspot-gc-dev in the
>>>> hope
>>>> of getting some extra feedback from our resident concurrency experts.
>>>>
>>>> Please review this subtle change to the order in which we read
>>>> fields in
>>>> G1OffsetTableContigSpace::saved_mark_word, original included here for
>>>> reference:
>>>> 1003
>>>> 1004 HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>>> 1005   G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>>> 1006   assert( _gc_time_stamp <= g1h->get_gc_time_stamp(),
>>>> "invariant" );
>>>> 1007   if (_gc_time_stamp < g1h->get_gc_time_stamp())
>>>> 1008     return top();
>>>> 1009   else
>>>> 1010     return Space::saved_mark_word();
>>>> 1011 }
>>>> 1012
>>>>
>>>> When getting a new gc alloc region several stores are performed where
>>>> store ordering needs to be enforced and several synchronization points
>>>> occur.
>>>> [write path]
>>>> ST(_saved_mark_word)
>>>> #StoreStore
>>>> ST(_gc_time_stamp)
>>>> ST(_top) // satisfying alloc request
>>>> #StoreStore
>>>> ST(_alloc_region) // publishing to other gc workers
>>>> #MonitorEnter
>>>> ST(_top) // potential further allocations
>>>> #MonitorExit
>>>> #MonitorEnter
>>>> ST(_top) // potential further allocations
>>>> #MonitorExit
>>>>
>>>> When we inspect a region during remembered set scanning we need to
>>>> ensure that we never read memory which have been allocated by a GC
>>>> worker thread for the purpose of copying objects into.
>>>> The way this works is that a time stamp field is supposed to signal
>>>> to a
>>>> scanning thread that it should look at addresses below _top if the time
>>>> stamp is old or addresses below _saved_mark_word if the time stamp is
>>>> current.
>>>>
>>>> The current code does (as seen above)
>>>> [read path]
>>>> LD(_gc_time_stamp)
>>>> LD(_top)
>>>> or (depending on time stamp)
>>>> LD(_saved_mark_word)
>>>>
>>>> Because these values are written to without full mutual exclusion we
>>>> need to be very careful about the order in which we read these values,
>>>> and this is where I argue that the current code is incorrect.
>>>> In order to observe a consistent view of the ordered stores in the
>>>> [write path] above we need to load the values in the reverse order they
>>>> were written, with proper #LoadLoad ordering enforced.
>>>>
>>>> The problem which we've observed here is that after we've read the time
>>>> stamp as below the heap time stamp the top pointer can be updated by a
>>>> GC worker allocating objects into this region. To make sure that the
>>>> top
>>>> value we see is in fact valid we must read it before we read the time
>>>> stamp which determines which value we should return from the
>>>> saved_mark_word function.
>>>>
>>>> My suggested fix is to load _top first and enforce #LoadLoad ordering
>>>> enforced:
>>>> HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>>>    G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>>>    assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" );
>>>>    HeapWord* local_top = top();
>>>>    OrderAccess::loadload();
>>>>    if (_gc_time_stamp < g1h->get_gc_time_stamp()) {
>>>>      return local_top;
>>>>    } else {
>>>>       return Space::saved_mark_word();
>>>>    }
>>>> }
>>>>
>>>> I've successfully reproduced the crash with the original code by adding
>>>> some random sleep calls between the load of the time stamp and the load
>>>> of top so I'm fairly certain that this resolves the issue. I've also
>>>> verified that the fix I'm proposing does resolve the bug for the team
>>>> which encountered the issue, even if I can't reproduce that crash
>>>> locally.
>>>>
>>>> I also plan to attempt design around some of the races in this code to
>>>> reduce its complexity, but for the sake of backporting the fix to 8u40
>>>> I'd like to start with just adding the minimal fix.
>>>>
>>>>
>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8058209
>>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0/
>>>> Testing: JPRT, local kitchensink (4 hours), gc test suite
>>>>
>>>> Thanks
>>>> /Mikael


More information about the hotspot-dev mailing list