RFR(S): 8170409: CMS: Crash in CardTableModRefBSForCTRS::process_chunk_boundaries

Tue Nov 29 22:37:27 UTC 2016

This looks good to me too, but I'm just wondering if the 2nd 
load_acquire inside the lock is really needed.

dl

On 11/29/16 10:26 AM, Volker Simonis wrote:
> On Tue, Nov 29, 2016 at 4:55 PM, Mikael Gerdin <mikael.gerdin at oracle.com> wrote:
>> Hi Volker,
>>
>> On 2016-11-29 14:59, Volker Simonis wrote:
>>> On Tue, Nov 29, 2016 at 11:58 AM, Mikael Gerdin
>>> <mikael.gerdin at oracle.com> wrote:
>>>> Hi Volker,
>>>>
>>>> On 2016-11-28 19:33, Volker Simonis wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> can I please have a review and sponsor for the following fix submitted
>>>>> by gunter.haug at sap.com:
>>>>>
>>>>> http://cr.openjdk.java.net/~simonis/webrevs/2016/8170409/
>>>>
>>>>
>>>> I haven't looked too closely on the change but one thing I'm curious
>>>> about
>>>> just to further my understanding of the issue:
>>>> Would an alternate solution be to always acquire the ParGCRareEvent_lock
>>>> before reading _last_LNC_resizing_collection[i] since in that case the
>>>> mutex
>>>> semantics would ensure proper ordering of the memory accesses?
>>>>
>> Ok, thanks for confirming my thoughts around this.
>>
>> I think the change seems ok but what do you think about making
>> _last_LNC_resizing_collection volatile to further ensure that there's not
>> funny compiler optimizations going on?
>>
> In general we agree that it would be fine to also make
> _last_LNC_resizing_collection volatile and here's an updated webrev
> which does exactly that:
>
> http://cr.openjdk.java.net/~simonis/webrevs/2016/8170409.v2/
>
> Our initial reasoning was that the aquiring the ParGCRareEvent_lock
> implies a C++ "sequence point" which should prevent the compiler from
> caching the value read from _last_LNC_resizing_collection. But with
> the "volatile" modifier the code is defintely clearer.
>
> @Thomas: can you please use the new webrev for pushing?
>
> Thanks,
> Volker
>
>> /Mikael
>>
>>
>>> Hi Mikael,
>>>
>>> we've discussed that as well and we think you are right in that always
>>> acquiring the ParGCRareEvent_lock before reading
>>> _last_LNC_resizing_collection[i] would solve the problem as well. But
>>> as ParGCRareEvent_lock is potentially a heavy-weight OS lock that
>>> would not be a practical solution.
>>>
>>> Thanks,
>>> Volker
>>>
>>>> Thanks
>>>> /Mikael
>>>>
>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8170409
>>>>>
>>>>> We've observed a crash (see bug report for a stack trace) in
>>>>> CardTableModRefBSForCTRS::process_chunk_boundaries() from time to time
>>>>> since several years now, but only on non TSO platforms:
>>>>>
>>>>> - It only happens in opt builds.
>>>>> - Analysis of the assembly code revealed the actual crash site to be
>>>>> an array store to a pointer (_lowest_non_clean) which is an argument
>>>>> to process_chunk_boundaries()
>>>>> - The pointer is actually calculated in
>>>>> CardTableModRefBS::get_LNC_array_for_space() and passed as argument to
>>>>> CardTableModRefBSForCTRS::process_chunk_boundaries()
>>>>> - CardTableModRefBS::get_LNC_array_for_space() doesn't enforce TSO on
>>>>> _last_LNC_resizing_collection[i] so the pointer to an uninitialized
>>>>> structure (i.e._lowest_non_clean) could become visible to other
>>>>> threads before the value of _last_LNC_resizing_collection[i].
>>>>>
>>>>> Solution:
>>>>>
>>>>> Use OrderAccess::load_acquire and OrderAccess::release_store for
>>>>> accessing _last_LNC_resizing_collection[i] in
>>>>> CardTableModRefBSForCTRS::get_LNC_array_for_space()
>>>>>
>>>>> Thanks you and best regards,
>>>>> Volker
>>>>>