RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Thu Sep 3 17:14:16 UTC 2015

Greetings,

A number of minor changes were made during code review round 2:

- 3 comment additions
- changing _has_PerfData from 'jint' -> 'bool'

These aren't worth another webrev so, for the record, I'm
including the context diffs below. Because I'm paranoid,
I've rerun the bug specific tests from 8049304 and
JDK-8129978 with the fix disabled (reproducing the crashes)
and with the fix enabled (where the tests passed).

Thanks again to Kim for the debug code that makes these
two failure modes so easy to reproduce. I'm pushing this
fix to RT_Baseline shortly. We should have one round of
nightly on these bits before I head off on vacation.

Dan

$ diff -c src/share/vm/runtime/objectMonitor.cpp{.cr2,}
*** src/share/vm/runtime/objectMonitor.cpp.cr2  Tue Sep  1 19:23:35 2015
--- src/share/vm/runtime/objectMonitor.cpp      Thu Sep  3 09:53:48 2015
***************
*** 572,577 ****
--- 572,579 ----
       // That is by design - we trade "lossy" counters which are exposed to
       // races during updates for a lower probe effect.
       TEVENT(Inflated enter - Futile wakeup);
+     // This PerfData object can be used in parallel with a safepoint.
+     // See the work around in PerfDataManager::destroy().
       OM_PERFDATA_OP(FutileWakeups, inc());
       ++nWakeups;

***************
*** 744,749 ****
--- 746,753 ----
       // *must* retry  _owner before parking.
       OrderAccess::fence();

+     // This PerfData object can be used in parallel with a safepoint.
+     // See the work around in PerfDataManager::destroy().
       OM_PERFDATA_OP(FutileWakeups, inc());
     }

$ diff -c src/share/vm/runtime/perfData.cpp{.cr2,}
*** src/share/vm/runtime/perfData.cpp.cr2       Tue Sep  1 19:39:17 2015
--- src/share/vm/runtime/perfData.cpp   Thu Sep  3 09:49:38 2015
***************
*** 39,45 ****
   PerfDataList*   PerfDataManager::_all = NULL;
   PerfDataList*   PerfDataManager::_sampled = NULL;
   PerfDataList*   PerfDataManager::_constants = NULL;
! volatile jint   PerfDataManager::_has_PerfData = 0;

   /*
    * The jvmstat global and subsystem jvmstat counter name spaces. The top
--- 39,45 ----
   PerfDataList*   PerfDataManager::_all = NULL;
   PerfDataList*   PerfDataManager::_sampled = NULL;
   PerfDataList*   PerfDataManager::_constants = NULL;
! volatile bool   PerfDataManager::_has_PerfData = 0;

   /*
    * The jvmstat global and subsystem jvmstat counter name spaces. The top
***************
*** 286,292 ****
     // manipulation before we free the memory. The two alternatives are
     // 1) leak the PerfData memory or 2) do some form of synchronized
     // access or check before every PerfData operation.
!   _has_PerfData = 0;
     os::naked_short_sleep(1);  // 1ms sleep to let other thread(s) run

     for (int index = 0; index < _all->length(); index++) {
--- 286,292 ----
     // manipulation before we free the memory. The two alternatives are
     // 1) leak the PerfData memory or 2) do some form of synchronized
     // access or check before every PerfData operation.
!   _has_PerfData = false;
     os::naked_short_sleep(1);  // 1ms sleep to let other thread(s) run

     for (int index = 0; index < _all->length(); index++) {
***************
*** 309,315 ****

     if (_all == NULL) {
       _all = new PerfDataList(100);
!     _has_PerfData = 1;
     }

     assert(!_all->contains(p->name()), "duplicate name added");
--- 309,315 ----

     if (_all == NULL) {
       _all = new PerfDataList(100);
!     _has_PerfData = true;
     }

     assert(!_all->contains(p->name()), "duplicate name added");

$ diff -c src/share/vm/runtime/perfData.hpp{.cr2,}
*** src/share/vm/runtime/perfData.hpp.cr2       Tue Sep  1 19:23:35 2015
--- src/share/vm/runtime/perfData.hpp   Thu Sep  3 09:40:53 2015
***************
*** 668,674 ****
       static PerfDataList* _sampled;
       static PerfDataList* _constants;
       static const char* _name_spaces[];
!     volatile static jint _has_PerfData;

       // add a PerfData item to the list(s) of know PerfData objects
       static void add_item(PerfData* p, bool sampled);
--- 668,674 ----
       static PerfDataList* _sampled;
       static PerfDataList* _constants;
       static const char* _name_spaces[];
!     static volatile bool _has_PerfData;

       // add a PerfData item to the list(s) of know PerfData objects
       static void add_item(PerfData* p, bool sampled);
***************
*** 870,876 ****
       }

       static void destroy();
!     static bool has_PerfData() { return _has_PerfData != 0; }
   };

   // Useful macros to create the performance counters
--- 870,876 ----
       }

       static void destroy();
!     static bool has_PerfData() { return _has_PerfData; }
   };

   // Useful macros to create the performance counters

$ diff -c src/share/vm/runtime/perfMemory.cpp{.cr2,}
*** src/share/vm/runtime/perfMemory.cpp.cr2     Tue Sep  1 19:39:45 2015
--- src/share/vm/runtime/perfMemory.cpp Thu Sep  3 09:50:05 2015
***************
*** 70,76 ****
     // objects that are currently being used by running JavaThreads
     // or the StatSampler. This method is invoked while we are not at
     // a safepoint during a VM abort so leaving the PerfData objects
!   // around may also help diagnose the failure.
     //
     if (SafepointSynchronize::is_at_safepoint() && 
!StatSampler::is_active()) {
       PerfDataManager::destroy();
--- 70,78 ----
     // objects that are currently being used by running JavaThreads
     // or the StatSampler. This method is invoked while we are not at
     // a safepoint during a VM abort so leaving the PerfData objects
!   // around may also help diagnose the failure. In rare cases,
!   // PerfData objects are used in parallel with a safepoint. See
!   // the work around in PerfDataManager::destroy().
     //
     if (SafepointSynchronize::is_at_safepoint() && 
!StatSampler::is_active()) {
       PerfDataManager::destroy();

On 9/1/15 10:45 PM, Daniel D. Daugherty wrote:
> Greetings,
>
> I've updated the "fix" for this bug based on code review comments
> received in round 1. I've dropped most of the changes from round 1
> with a couple of exceptions.
>
> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
> https://bugs.openjdk.java.net/browse/JDK-8049304
>
> Webrev URL: 
> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/
>
> The easiest way to re-review is to download the two patch files
> (round 0 and round 2) and view them in your favorite file merge tool:
>
> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>
> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch 
>
>
> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>          Aurora Adhoc vm.tmtools batch (in process)
>          Kim's repro sequence for JDK-8049304
>          Kim's repro sequence for JDK-8129978
>          JPRT -testset hotspot
>
> Changes between round 0 and round 2:
>
> - clarify a few comments
> - init _has_PerfData flag with '0' (instead of false)
> - drop unnecessary use OrderAccess::release_store() to set
>   _has_PerfData to '1' (we're in a Mutex)
> - change perfMemory_exit() to only call PerfDataManager::destroy()
>   when called at a safepoint and when the StatSampler is not
>   running; this means when the VM is aborting, we no longer have
>   a race between the original crash report and this code path.
>
> Changes between round 1 and round 2:
>
> - clarify a few comments
> - drop is_safe parameter to OM_PERFDATA_OP macro
> - init _has_PerfData flag with '0' (instead of false)
> - drop OrderAccess::fence() call before os::naked_short_sleep() call
> - drop PerfDataManager::has_PerfData_with_acquire()
> - drop unnecessary use OrderAccess::release_store() to set
>   _has_PerfData to '1' (we're in a Mutex)
>
> I believe that I've addressed all comments from round 0 and
> from round 1.
>
> Thanks, in advance, for any comments, questions or suggestions.
>
> Dan
>
>
> On 8/31/15 4:51 PM, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I've updated the "fix" for this bug based on code review comments
>> received in round 0.
>>
>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>
>> Webrev URL: 
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>
>> The easiest way to re-review is to download the two patch files
>> and view them in your favorite file merge tool:
>>
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>>
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch 
>>
>>
>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>          Aurora Adhoc vm.tmtools batch (in process)
>>          Kim's repro sequence for JDK-8049304
>>          Kim's repro sequence for JDK-8129978
>>          JPRT -testset hotspot
>>
>> Changes between round 0 and round 1:
>>
>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>   safepoint-safe callers can access _has_PerfData flag directly;
>>   non-safepoint-safe callers use a load-acquire to fetch the
>>   current _has_PerfData flag value
>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>   to zero (field is volatile) and then use a fence() to prevent
>>   any reordering of operations in any direction; it's only done
>>   once during VM shutdown so...
>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>   when called at a safepoint and when the StatSample is not
>>   running; this means when the VM is aborting, we no longer have
>>   a race between the original crash report and this code path.
>>
>> I believe that I've addressed all comments from round 0.
>>
>> Thanks, in advance, for any comments, questions or suggestions.
>>
>> Dan
>>
>>
>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> I have a "fix" for a long standing race between JVM shutdown and the
>>> JVM statistics subsystem:
>>>
>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>
>>> Webrev URL: 
>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>
>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>          Aurora Adhoc vm.tmtools batch
>>>          Kim's repro sequence for JDK-8049304
>>>          Kim's repro sequence for JDK-8129978
>>>          JPRT -testset hotspot
>>>
>>> This "fix":
>>>
>>> - adds a volatile flag to record whether PerfDataManager is holding
>>>   data (PerfData objects)
>>> - adds PerfDataManager::has_PerfData() to return the flag
>>> - changes the Java monitor subsystem's use of PerfData to
>>>   check both allocation of the monitor subsystem specific
>>>   PerfData object and the new PerfDataManager::has_PerfData()
>>>   return value
>>>
>>> If the global 'UsePerfData' option is false, the system works as
>>> it did before. If 'UsePerfData' is true (the default on non-embedded
>>> systems), the Java monitor subsystem will allocate a number of
>>> PerfData objects to record information. The objects will record
>>> information about Java monitor subsystem until the JVM shuts down.
>>>
>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>> change to false and the Java monitor subsystem will stop using the
>>> PerfData objects. This is the new behavior. As noted in the comments
>>> I added to the code, the race is still present; I'm just changing
>>> the order and the timing to reduce the likelihood of the crash.
>>>
>>> Thanks, in advance, for any comments, questions or suggestions.
>>>
>>> Dan
>>>
>>>
>>>
>>
>>
>>
>
>
>