RFR: JDK-8058209: Race in G1 card scanning could allow scanning of memory covered by PLABs
Bengt Rutisson
bengt.rutisson at oracle.com
Fri Nov 14 08:10:35 UTC 2014
Hi Mikael,
Latest webrev looks good to me too.
Thanks,
Bengt
On 2014-11-13 18:11, Daniel D. Daugherty wrote:
> On 11/13/14 8:30 AM, Mikael Gerdin wrote:
>> Hi Dan,
>>
>> On 2014-11-13 15:35, Daniel D. Daugherty wrote:
>>> > Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/
>>>
>>> src/share/vm/gc_implementation/g1/heapRegion.cpp
>>> nit line 1007: assert(local_time_stamp <=
>>> g1h->get_gc_time_stamp(),
>>> "invariant" );
>>> You got rid of the space after the '('. Can you also get
>>> rid of the space before ');'?
>>
>> I'll fix that before I push the change.
>
> Thanks!
>
>
>>
>>>
>>> Thanks for putting _gc_time_stamp in a local so that it's
>>> not fetched more than once.
>>>
>>> Would it be useful to move the assert() after the setting of
>>> local_top? That would give someone debugging the assertion
>>> failure core file a little more context. Dunno... your call.
>>
>> The assert is not strictly related to the _top field, if the time
>> stamp assert fails we're most likely experiencing a completely
>> different problem.
>> I'd rather leave the assert where it is.
>
> No problem.
>
> Dan
>
>
>>
>> /Mikael
>>
>>>
>>> Dan
>>>
>>>
>>> On 11/13/14 3:19 AM, Mikael Gerdin wrote:
>>>> Hi David,
>>>>
>>>> On 2014-11-13 05:54, David Holmes wrote:
>>>>> Hi Mikael,
>>>>>
>>>>> Without knowing the details it is hard to determine the
>>>>> correctness of
>>>>> this. What you describe below sounds reasonable - but what about the
>>>>> opposite problem in the new code: what if you read an old top()
>>>>> then a
>>>>> new timestamp, before top() is updated? Will that work correctly
>>>>> or will
>>>>> the region between the old-top and new-top be missed?
>>>>
>>>> I realize that not everyone is up to speed on the specifics of this
>>>> code, but I appreciate you feedback on the general reasoning.
>>>>
>>>> Reading an old _top value is safe, and in fact we must enforce that
>>>> the only _top values we ever return from this functions were set
>>>> before the GC occurred.
>>>>
>>>> Reading a too recent _top value is the cause of the crash in this bug,
>>>> since if this function returns a a recently updated _top value that is
>>>> because another GC worker has allocated into this region and is in the
>>>> process of copying objects into it. The point of the timestamp value
>>>> is to only return old values of _top and if the timestamp is current
>>>> it should return another value.
>>>>
>>>> I've updated the webrev slightly due to off-list feedback that I
>>>> should attempt to avoid reading the time stamp more than once (for the
>>>> assert).
>>>> I've also noticed that I messed up the indentation of a curly brace so
>>>> I fixed that as well.
>>>>
>>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/
>>>>
>>>> Incremental webrev:
>>>> http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0_to_1/
>>>>
>>>> /Mikael
>>>>
>>>>>
>>>>> Cheers,
>>>>> David H.
>>>>>
>>>>> On 12/11/2014 1:18 AM, Mikael Gerdin wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I've sent this to hotspot-dev instead of just hotspot-gc-dev in the
>>>>>> hope
>>>>>> of getting some extra feedback from our resident concurrency
>>>>>> experts.
>>>>>>
>>>>>> Please review this subtle change to the order in which we read
>>>>>> fields in
>>>>>> G1OffsetTableContigSpace::saved_mark_word, original included here
>>>>>> for
>>>>>> reference:
>>>>>> 1003
>>>>>> 1004 HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>>>>> 1005 G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>>>>> 1006 assert( _gc_time_stamp <= g1h->get_gc_time_stamp(),
>>>>>> "invariant" );
>>>>>> 1007 if (_gc_time_stamp < g1h->get_gc_time_stamp())
>>>>>> 1008 return top();
>>>>>> 1009 else
>>>>>> 1010 return Space::saved_mark_word();
>>>>>> 1011 }
>>>>>> 1012
>>>>>>
>>>>>> When getting a new gc alloc region several stores are performed
>>>>>> where
>>>>>> store ordering needs to be enforced and several synchronization
>>>>>> points
>>>>>> occur.
>>>>>> [write path]
>>>>>> ST(_saved_mark_word)
>>>>>> #StoreStore
>>>>>> ST(_gc_time_stamp)
>>>>>> ST(_top) // satisfying alloc request
>>>>>> #StoreStore
>>>>>> ST(_alloc_region) // publishing to other gc workers
>>>>>> #MonitorEnter
>>>>>> ST(_top) // potential further allocations
>>>>>> #MonitorExit
>>>>>> #MonitorEnter
>>>>>> ST(_top) // potential further allocations
>>>>>> #MonitorExit
>>>>>>
>>>>>> When we inspect a region during remembered set scanning we need to
>>>>>> ensure that we never read memory which have been allocated by a GC
>>>>>> worker thread for the purpose of copying objects into.
>>>>>> The way this works is that a time stamp field is supposed to signal
>>>>>> to a
>>>>>> scanning thread that it should look at addresses below _top if
>>>>>> the time
>>>>>> stamp is old or addresses below _saved_mark_word if the time
>>>>>> stamp is
>>>>>> current.
>>>>>>
>>>>>> The current code does (as seen above)
>>>>>> [read path]
>>>>>> LD(_gc_time_stamp)
>>>>>> LD(_top)
>>>>>> or (depending on time stamp)
>>>>>> LD(_saved_mark_word)
>>>>>>
>>>>>> Because these values are written to without full mutual exclusion we
>>>>>> need to be very careful about the order in which we read these
>>>>>> values,
>>>>>> and this is where I argue that the current code is incorrect.
>>>>>> In order to observe a consistent view of the ordered stores in the
>>>>>> [write path] above we need to load the values in the reverse
>>>>>> order they
>>>>>> were written, with proper #LoadLoad ordering enforced.
>>>>>>
>>>>>> The problem which we've observed here is that after we've read
>>>>>> the time
>>>>>> stamp as below the heap time stamp the top pointer can be updated
>>>>>> by a
>>>>>> GC worker allocating objects into this region. To make sure that the
>>>>>> top
>>>>>> value we see is in fact valid we must read it before we read the
>>>>>> time
>>>>>> stamp which determines which value we should return from the
>>>>>> saved_mark_word function.
>>>>>>
>>>>>> My suggested fix is to load _top first and enforce #LoadLoad
>>>>>> ordering
>>>>>> enforced:
>>>>>> HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>>>>> G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>>>>> assert( _gc_time_stamp <= g1h->get_gc_time_stamp(),
>>>>>> "invariant" );
>>>>>> HeapWord* local_top = top();
>>>>>> OrderAccess::loadload();
>>>>>> if (_gc_time_stamp < g1h->get_gc_time_stamp()) {
>>>>>> return local_top;
>>>>>> } else {
>>>>>> return Space::saved_mark_word();
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> I've successfully reproduced the crash with the original code by
>>>>>> adding
>>>>>> some random sleep calls between the load of the time stamp and
>>>>>> the load
>>>>>> of top so I'm fairly certain that this resolves the issue. I've also
>>>>>> verified that the fix I'm proposing does resolve the bug for the
>>>>>> team
>>>>>> which encountered the issue, even if I can't reproduce that crash
>>>>>> locally.
>>>>>>
>>>>>> I also plan to attempt design around some of the races in this
>>>>>> code to
>>>>>> reduce its complexity, but for the sake of backporting the fix to
>>>>>> 8u40
>>>>>> I'd like to start with just adding the minimal fix.
>>>>>>
>>>>>>
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8058209
>>>>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0/
>>>>>> Testing: JPRT, local kitchensink (4 hours), gc test suite
>>>>>>
>>>>>> Thanks
>>>>>> /Mikael
>>>
>
More information about the hotspot-dev
mailing list