RFR(S): 8170409: CMS: Crash in CardTableModRefBSForCTRS::process_chunk_boundaries

Tue Nov 29 18:26:04 UTC 2016

On Tue, Nov 29, 2016 at 4:55 PM, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:
> Hi Volker,
>
> On 2016-11-29 14:59, Volker Simonis wrote:
>>
>> On Tue, Nov 29, 2016 at 11:58 AM, Mikael Gerdin
>> <mikael.gerdin at oracle.com> wrote:
>>>
>>> Hi Volker,
>>>
>>> On 2016-11-28 19:33, Volker Simonis wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> can I please have a review and sponsor for the following fix submitted
>>>> by gunter.haug at sap.com:
>>>>
>>>> http://cr.openjdk.java.net/~simonis/webrevs/2016/8170409/
>>>
>>>
>>>
>>> I haven't looked too closely on the change but one thing I'm curious
>>> about
>>> just to further my understanding of the issue:
>>> Would an alternate solution be to always acquire the ParGCRareEvent_lock
>>> before reading _last_LNC_resizing_collection[i] since in that case the
>>> mutex
>>> semantics would ensure proper ordering of the memory accesses?
>>>
>
> Ok, thanks for confirming my thoughts around this.
>
> I think the change seems ok but what do you think about making
> _last_LNC_resizing_collection volatile to further ensure that there's not
> funny compiler optimizations going on?
>

In general we agree that it would be fine to also make
_last_LNC_resizing_collection volatile and here's an updated webrev
which does exactly that:

http://cr.openjdk.java.net/~simonis/webrevs/2016/8170409.v2/

Our initial reasoning was that the aquiring the ParGCRareEvent_lock
implies a C++ "sequence point" which should prevent the compiler from
caching the value read from _last_LNC_resizing_collection. But with
the "volatile" modifier the code is defintely clearer.

@Thomas: can you please use the new webrev for pushing?

Thanks,
Volker

> /Mikael
>
>
>>
>> Hi Mikael,
>>
>> we've discussed that as well and we think you are right in that always
>> acquiring the ParGCRareEvent_lock before reading
>> _last_LNC_resizing_collection[i] would solve the problem as well. But
>> as ParGCRareEvent_lock is potentially a heavy-weight OS lock that
>> would not be a practical solution.
>>
>> Thanks,
>> Volker
>>
>>> Thanks
>>> /Mikael
>>>
>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8170409
>>>>
>>>> We've observed a crash (see bug report for a stack trace) in
>>>> CardTableModRefBSForCTRS::process_chunk_boundaries() from time to time
>>>> since several years now, but only on non TSO platforms:
>>>>
>>>> - It only happens in opt builds.
>>>> - Analysis of the assembly code revealed the actual crash site to be
>>>> an array store to a pointer (_lowest_non_clean) which is an argument
>>>> to process_chunk_boundaries()
>>>> - The pointer is actually calculated in
>>>> CardTableModRefBS::get_LNC_array_for_space() and passed as argument to
>>>> CardTableModRefBSForCTRS::process_chunk_boundaries()
>>>> - CardTableModRefBS::get_LNC_array_for_space() doesn't enforce TSO on
>>>> _last_LNC_resizing_collection[i] so the pointer to an uninitialized
>>>> structure (i.e._lowest_non_clean) could become visible to other
>>>> threads before the value of _last_LNC_resizing_collection[i].
>>>>
>>>> Solution:
>>>>
>>>> Use OrderAccess::load_acquire and OrderAccess::release_store for
>>>> accessing _last_LNC_resizing_collection[i] in
>>>> CardTableModRefBSForCTRS::get_LNC_array_for_space()
>>>>
>>>> Thanks you and best regards,
>>>> Volker
>>>>
>>>
>