RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Tue Sep 1 14:49:29 UTC 2015

On 9/1/15 12:13 AM, David Holmes wrote:
> Hi Dan,
>
> Hope you had that coffee :)

Just got that second cup... :-)

>
> On 1/09/2015 8:51 AM, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I've updated the "fix" for this bug based on code review comments
>> received in round 0.
>>
>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>
>> Webrev URL: 
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>
>> The easiest way to re-review is to download the two patch files
>> and view them in your favorite file merge tool:
>>
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>>
>>
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch 
>>
>>
>>
>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>           Aurora Adhoc vm.tmtools batch (in process)
>>           Kim's repro sequence for JDK-8049304
>>           Kim's repro sequence for JDK-8129978
>>           JPRT -testset hotspot
>>
>> Changes between round 0 and round 1:
>>
>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>
> The sense of safepoint-safe is a bit confusing to me. If the thread is 
> in a safepoint-safe-state then it seems safepoint-safe==false ??

I went back and forth on the naming of this parameter. When 'is_safe'
is true, the calling thread can safely and directly check has_PerfData
because we are not at a safepoint so we can't be racing with a normal
exit. In a normal exit, has_PerfData is only changed (and memory freed)
at a safepoint.

When 'is_safe' is false, then the code path we're on could be executing
in parallel with a safepoint so we need to be more careful with accessing
the has_PerfData flag...

If you have a better name for the parameter...

>
>>    safepoint-safe callers can access _has_PerfData flag directly;
>>    non-safepoint-safe callers use a load-acquire to fetch the
>>    current _has_PerfData flag value
>
> load-acquire makes no difference to the "currentness" of _has_PerfData 
> and has no affect on the size of the "race window". So I don't see any 
> point to this is-safe distinction.

Hmmm... I think I got this idea when I re-read orderAccess.hpp where
it talks about release-store and load-acquire pairs. I originally
had the transition of _has_PerfData from '1' -> '0' done by a
release_store() so I was using a load_acquire() as part of the
pair. I've since switched to a direct setting of "_has_PerfData = 0"
followed by the big hammer fence(), but I think the load_acquire()
still pairs with that fence().

>
>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>    to zero (field is volatile) and then use a fence() to prevent
>>    any reordering of operations in any direction; it's only done
>>    once during VM shutdown so...
>
> See previous email :)

Yup. I did and I'd still like to keep it for the reasons stated
in my reply to the previous e-mail. You were good with this code
as a release_store() of '0' so I kind of expected you to be OK
with the stronger fence().

>
>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>    when called at a safepoint and when the StatSample is not
>>    running; this means when the VM is aborting, we no longer have
>>    a race between the original crash report and this code path.
>
> Sounds reasonable.

Thanks. I hope it makes Kim happy also.

>
> I have few specific comments/nits:
>
> src/share/vm/runtime/perfData.cpp
>
>  43 volatile jint   PerfDataManager::_has_PerfData = false;
>
> Nit: jint should be initialized with 0 not false.

Will fix; oops...that made it past round 0...

>
>  287   // manipulation before we free the memory. The two alternatives
>  288   // are 1) leak the PerfData memory or 2) do some form of ordered
>  289   // access before every PerfData operation.
>
> No amount of ordered access will fix the race - only synchronization 
> of some form.

I thought I had changed that comment. Will fix.

>
>  315 void PerfDataManager::add_item(PerfData* p, bool sampled) {
>  316
>  317   MutexLocker ml(PerfDataManager_lock);
>  318
>  319   if (_all == NULL) {
>  320     _all = new PerfDataList(100);
>  321     OrderAccess::release_store(&_has_PerfData, 1);
>  322   }
>
> I can't see any purpose for a release_store here. You have one-time 
> initialization code guarded with a mutex. A release_store adds nothing 
> here - other than future head-scratching from maintainers of this code.

Agreed. Not sure how I missed the fact that we were in the
locked block.  oops...that made it past round 0...

Thanks again for the review. Will generate another webrev...

Dan

>
> Thanks,
> David
>
>> I believe that I've addressed all comments from round 0.
>>
>> Thanks, in advance, for any comments, questions or suggestions.
>>
>> Dan
>>
>>
>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> I have a "fix" for a long standing race between JVM shutdown and the
>>> JVM statistics subsystem:
>>>
>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>
>>> Webrev URL:
>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>
>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>          Aurora Adhoc vm.tmtools batch
>>>          Kim's repro sequence for JDK-8049304
>>>          Kim's repro sequence for JDK-8129978
>>>          JPRT -testset hotspot
>>>
>>> This "fix":
>>>
>>> - adds a volatile flag to record whether PerfDataManager is holding
>>>   data (PerfData objects)
>>> - adds PerfDataManager::has_PerfData() to return the flag
>>> - changes the Java monitor subsystem's use of PerfData to
>>>   check both allocation of the monitor subsystem specific
>>>   PerfData object and the new PerfDataManager::has_PerfData()
>>>   return value
>>>
>>> If the global 'UsePerfData' option is false, the system works as
>>> it did before. If 'UsePerfData' is true (the default on non-embedded
>>> systems), the Java monitor subsystem will allocate a number of
>>> PerfData objects to record information. The objects will record
>>> information about Java monitor subsystem until the JVM shuts down.
>>>
>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>> change to false and the Java monitor subsystem will stop using the
>>> PerfData objects. This is the new behavior. As noted in the comments
>>> I added to the code, the race is still present; I'm just changing
>>> the order and the timing to reduce the likelihood of the crash.
>>>
>>> Thanks, in advance, for any comments, questions or suggestions.
>>>
>>> Dan
>>>
>>>
>>>
>>