RFR: JDK-8058209: Race in G1 card scanning could allow scanning of memory covered by PLABs
Daniel D. Daugherty
daniel.daugherty at oracle.com
Thu Nov 13 17:11:30 UTC 2014
On 11/13/14 8:30 AM, Mikael Gerdin wrote:
> Hi Dan,
>
> On 2014-11-13 15:35, Daniel D. Daugherty wrote:
>> > Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/
>>
>> src/share/vm/gc_implementation/g1/heapRegion.cpp
>> nit line 1007: assert(local_time_stamp <= g1h->get_gc_time_stamp(),
>> "invariant" );
>> You got rid of the space after the '('. Can you also get
>> rid of the space before ');'?
>
> I'll fix that before I push the change.
Thanks!
>
>>
>> Thanks for putting _gc_time_stamp in a local so that it's
>> not fetched more than once.
>>
>> Would it be useful to move the assert() after the setting of
>> local_top? That would give someone debugging the assertion
>> failure core file a little more context. Dunno... your call.
>
> The assert is not strictly related to the _top field, if the time
> stamp assert fails we're most likely experiencing a completely
> different problem.
> I'd rather leave the assert where it is.
No problem.
Dan
>
> /Mikael
>
>>
>> Dan
>>
>>
>> On 11/13/14 3:19 AM, Mikael Gerdin wrote:
>>> Hi David,
>>>
>>> On 2014-11-13 05:54, David Holmes wrote:
>>>> Hi Mikael,
>>>>
>>>> Without knowing the details it is hard to determine the correctness of
>>>> this. What you describe below sounds reasonable - but what about the
>>>> opposite problem in the new code: what if you read an old top() then a
>>>> new timestamp, before top() is updated? Will that work correctly or
>>>> will
>>>> the region between the old-top and new-top be missed?
>>>
>>> I realize that not everyone is up to speed on the specifics of this
>>> code, but I appreciate you feedback on the general reasoning.
>>>
>>> Reading an old _top value is safe, and in fact we must enforce that
>>> the only _top values we ever return from this functions were set
>>> before the GC occurred.
>>>
>>> Reading a too recent _top value is the cause of the crash in this bug,
>>> since if this function returns a a recently updated _top value that is
>>> because another GC worker has allocated into this region and is in the
>>> process of copying objects into it. The point of the timestamp value
>>> is to only return old values of _top and if the timestamp is current
>>> it should return another value.
>>>
>>> I've updated the webrev slightly due to off-list feedback that I
>>> should attempt to avoid reading the time stamp more than once (for the
>>> assert).
>>> I've also noticed that I messed up the indentation of a curly brace so
>>> I fixed that as well.
>>>
>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.1/
>>>
>>> Incremental webrev:
>>> http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0_to_1/
>>>
>>> /Mikael
>>>
>>>>
>>>> Cheers,
>>>> David H.
>>>>
>>>> On 12/11/2014 1:18 AM, Mikael Gerdin wrote:
>>>>> Hi all,
>>>>>
>>>>> I've sent this to hotspot-dev instead of just hotspot-gc-dev in the
>>>>> hope
>>>>> of getting some extra feedback from our resident concurrency experts.
>>>>>
>>>>> Please review this subtle change to the order in which we read
>>>>> fields in
>>>>> G1OffsetTableContigSpace::saved_mark_word, original included here for
>>>>> reference:
>>>>> 1003
>>>>> 1004 HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>>>> 1005 G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>>>> 1006 assert( _gc_time_stamp <= g1h->get_gc_time_stamp(),
>>>>> "invariant" );
>>>>> 1007 if (_gc_time_stamp < g1h->get_gc_time_stamp())
>>>>> 1008 return top();
>>>>> 1009 else
>>>>> 1010 return Space::saved_mark_word();
>>>>> 1011 }
>>>>> 1012
>>>>>
>>>>> When getting a new gc alloc region several stores are performed where
>>>>> store ordering needs to be enforced and several synchronization
>>>>> points
>>>>> occur.
>>>>> [write path]
>>>>> ST(_saved_mark_word)
>>>>> #StoreStore
>>>>> ST(_gc_time_stamp)
>>>>> ST(_top) // satisfying alloc request
>>>>> #StoreStore
>>>>> ST(_alloc_region) // publishing to other gc workers
>>>>> #MonitorEnter
>>>>> ST(_top) // potential further allocations
>>>>> #MonitorExit
>>>>> #MonitorEnter
>>>>> ST(_top) // potential further allocations
>>>>> #MonitorExit
>>>>>
>>>>> When we inspect a region during remembered set scanning we need to
>>>>> ensure that we never read memory which have been allocated by a GC
>>>>> worker thread for the purpose of copying objects into.
>>>>> The way this works is that a time stamp field is supposed to signal
>>>>> to a
>>>>> scanning thread that it should look at addresses below _top if the
>>>>> time
>>>>> stamp is old or addresses below _saved_mark_word if the time stamp is
>>>>> current.
>>>>>
>>>>> The current code does (as seen above)
>>>>> [read path]
>>>>> LD(_gc_time_stamp)
>>>>> LD(_top)
>>>>> or (depending on time stamp)
>>>>> LD(_saved_mark_word)
>>>>>
>>>>> Because these values are written to without full mutual exclusion we
>>>>> need to be very careful about the order in which we read these
>>>>> values,
>>>>> and this is where I argue that the current code is incorrect.
>>>>> In order to observe a consistent view of the ordered stores in the
>>>>> [write path] above we need to load the values in the reverse order
>>>>> they
>>>>> were written, with proper #LoadLoad ordering enforced.
>>>>>
>>>>> The problem which we've observed here is that after we've read the
>>>>> time
>>>>> stamp as below the heap time stamp the top pointer can be updated
>>>>> by a
>>>>> GC worker allocating objects into this region. To make sure that the
>>>>> top
>>>>> value we see is in fact valid we must read it before we read the time
>>>>> stamp which determines which value we should return from the
>>>>> saved_mark_word function.
>>>>>
>>>>> My suggested fix is to load _top first and enforce #LoadLoad ordering
>>>>> enforced:
>>>>> HeapWord* G1OffsetTableContigSpace::saved_mark_word() const {
>>>>> G1CollectedHeap* g1h = G1CollectedHeap::heap();
>>>>> assert( _gc_time_stamp <= g1h->get_gc_time_stamp(), "invariant" );
>>>>> HeapWord* local_top = top();
>>>>> OrderAccess::loadload();
>>>>> if (_gc_time_stamp < g1h->get_gc_time_stamp()) {
>>>>> return local_top;
>>>>> } else {
>>>>> return Space::saved_mark_word();
>>>>> }
>>>>> }
>>>>>
>>>>> I've successfully reproduced the crash with the original code by
>>>>> adding
>>>>> some random sleep calls between the load of the time stamp and the
>>>>> load
>>>>> of top so I'm fairly certain that this resolves the issue. I've also
>>>>> verified that the fix I'm proposing does resolve the bug for the team
>>>>> which encountered the issue, even if I can't reproduce that crash
>>>>> locally.
>>>>>
>>>>> I also plan to attempt design around some of the races in this
>>>>> code to
>>>>> reduce its complexity, but for the sake of backporting the fix to
>>>>> 8u40
>>>>> I'd like to start with just adding the minimal fix.
>>>>>
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8058209
>>>>> Webrev: http://cr.openjdk.java.net/~mgerdin/8058209/webrev.0/
>>>>> Testing: JPRT, local kitchensink (4 hours), gc test suite
>>>>>
>>>>> Thanks
>>>>> /Mikael
>>
More information about the hotspot-dev
mailing list