RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()
Daniel D. Daugherty
daniel.daugherty at oracle.com
Thu Aug 27 15:51:19 UTC 2015
Hi David!
Thanks for chiming in on this thread!
Replies embedded below as usual...
On 8/26/15 10:26 PM, David Holmes wrote:
> Hi Dan,
>
> On 26/08/2015 7:08 AM, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I have a "fix" for a long standing race between JVM shutdown and the
>> JVM statistics subsystem:
>>
>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>
>> Webrev URL:
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>
>> Testing: Aurora Adhoc RT-SVC nightly batch
>> Aurora Adhoc vm.tmtools batch
>> Kim's repro sequence for JDK-8049304
>> Kim's repro sequence for JDK-8129978
>> JPRT -testset hotspot
>>
>> This "fix":
>>
>> - adds a volatile flag to record whether PerfDataManager is holding
>> data (PerfData objects)
>> - adds PerfDataManager::has_PerfData() to return the flag
>> - changes the Java monitor subsystem's use of PerfData to
>> check both allocation of the monitor subsystem specific
>> PerfData object and the new PerfDataManager::has_PerfData()
>> return value
>>
>> If the global 'UsePerfData' option is false, the system works as
>> it did before. If 'UsePerfData' is true (the default on non-embedded
>> systems), the Java monitor subsystem will allocate a number of
>> PerfData objects to record information. The objects will record
>> information about Java monitor subsystem until the JVM shuts down.
>>
>> When the JVM starts to shutdown, the new PerfDataManager flag will
>> change to false and the Java monitor subsystem will stop using the
>> PerfData objects. This is the new behavior. As noted in the comments
>> I added to the code, the race is still present; I'm just changing
>> the order and the timing to reduce the likelihood of the crash.
>
> Right. To sum up: the basic problem is that the PerfData objects are
> deallocated at the safepoint established for VM termination, but those
> objects can actually be used by threads that are in a safepoint-safe
> state: in particular within the low-level synchronization code.
>
> As you say this fix narrows the window where a crash can occur, but
> can not close it. If a thread is descheduled after the check of
> hasPerfData it can still access the PerfData object when it resumes,
> which may be after the object was deallocated. There's no true fix
> here without introducing synchronization (which would have to be even
> lower-level to avoid reentrant use of the same code we're fixing!) and
> the overhead of that would be prohibitive for these perf counters.
>
> In response to Kim's concern about other code that uses PerfData
> objects I think you would have to examine those uses to see which, if
> any, can occur from either a non-JavaThread, or from within the code
> where a thread is considered safepoint-safe. I'm inclined to agree
> that given we have not seen issues with such code, either it does not
> exist or is extremely unlikely to hit this issue. Given the "fix" is
> itself only narrowing the window it doesn't seem necessary to address
> code that already has a narrower window.
>
> That all said "leaking" the PerfData objects seems no less unpleasant
> a "fix". There are so many obstacles in the way of being able to
> unload and re-load the JVM that I do not think this makes the position
> measurably worse. In fact I can imagine that if we were to allow for
> such behaviour we would need to be able to terminate threads and
> reclaim all their resources (like Monitor instances), at which point
> it would also become easy to deallocate shared memory like PerfData
> objects.
Here's what I wrote in the bug report before I started this review cycle:
Daniel Daugherty added a comment - 2015-08-21 20:40
Continued investigating VM shutdown race:
JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
JDK-8129978 SIGSEGV when parsing command line options
- Thanks to Kim for providing easy reproduction instructions for
both bugs; I've tweaked the repro code a bit
- The "correct" solution is to add a locking/memory ordering
mechanism to ensure that PerfData is only used when valid.
The locking/memory ordering would slow down the PerfData
mechanism for every update. Ouch!
- The "fast but safe" solution is to leak the PerfData memory
and not clean them up at VM shutdown. We're trying to clean
up the code base so the idea of intentionally leaking memory
makes me cringe.
- The solution I'm investigating is between "fast but safe"
and "correct". I'm adding a PerfDataManager.has_PerfData()
function that returns true when PerfDataManager is holding
PerfData objects and false when none have been allocated
or when they have been freed at VM shutdown. The flag
holding the state is volatile and I use release_store()
to change it so that publication is visible more quickly.
On the VM shutdown path, I also do a 1ms sleep after setting
the flag and before freeing the memory.
- The idea is that the 1ms sleep will give any threads that
saw PerfDataManager.has_PerfData() == true a chance to do
their operation on the PerfData object before VM shutdown
thread frees it.
So I think we're all roughly on the same page here:
1) We don't like the current system because we keep getting
these shutdown race crashes. Of course, a new one came
in early this AM:
JDK-8134566 java/lang/invoke/LFCaching/LFMultiThreadCachingTest.java
crashes in monitor synchronization code
https://bugs.openjdk.java.net/browse/JDK-8134566
2) We don't like the "correct" solution because it would slow down
the performance counters and possibly skew the very data we are
trying to gather. Kim has also pointed out that adding more
locking in a subsystem used by higher level locking is risky.
3) We don't like the "fast but safe" solution of leaking the PerfData
memory. We try to make ourselves feel better about this by saying
there are plenty of other leaks in the VM... slippery slope?
4) We don't like the proposed solution because the race still exists
and we could continue to see failures like these. Only they would
be more rare and possibly harder to spot.
5) Off-thread Kim and I have been talking about adding logic to the
signal handler filters to detect a SIGSEGV that comes from use of
a now freed PerfData object. We're mulling on the idea, but have
not determined if it is even possible or an acceptable idea...
Hopefully, the above accurately sums up our options...
> I'll leave it up to you which way to go. As it stands this is Reviewed.
Thanks!
Here's my proposed plan:
1) I'd like to move forward with this change in order to reduce the
occurrences of this crasher. Yes, I'm getting tired of seeing and
analyzing them.
2) If we see PerfData crashes in the future with non-monitor subsystem
PerfData usage, then we look at adding has_PerfData() calls to that
subsystem.
3) If we see PerfData crashes in the future in the monitor subsystem,
then that indicates that the theoretical race is real or I missed
protecting a PerfData usage with has_PerfData(). If the race is
real, then we examine these alternatives:
- leak the PerfData objects on the JVM shutdown path, i.e.,
switch to the "fast but safe" solution
- add signal handler support to make PerfData SIGSEGVs benign
What do folks think?
Dan
>
> Thanks,
> David
>
>> Thanks, in advance, for any comments, questions or suggestions.
>>
>> Dan
>>
More information about the hotspot-runtime-dev
mailing list