RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()
Daniel D. Daugherty
daniel.daugherty at oracle.com
Thu Sep 3 15:50:57 UTC 2015
Kim,
Wow... Not at all where I was expecting this review to go. Normally
I continue a review cycle until I can negotiate agreement amongst
all the reviewers, but I can see that's just not going to happen
with this review.
Short version:
I'll take some of the feedback from this review and I'll make
some minor changes, but I have no plans to change the code to
address most of these comments.
Slightly longer version:
The fix that I have in hand resolves the issues that I was
trying to fix. It also resolves the issue described in
JDK-8129978. Additional changes will require a new bug or,
if you wish, you can reopen this bug:
JDK-8129978 SIGSEGV when parsing command line options
https://bugs.openjdk.java.net/browse/JDK-8129978
and use it to make additional changes. I will be on vacation
next week and David Holmes will be on vacation for a month
starting next week. Any further work in this area should
wait until David returns since he has strong opinions on
this code.
Responses are embedded below.
Thanks for your review.
Dan
On 9/2/15 8:15 PM, Kim Barrett wrote:
> On Sep 2, 2015, at 12:45 AM, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
>> Greetings,
>>
>> I've updated the "fix" for this bug based on code review comments
>> received in round 1. I've dropped most of the changes from round 1
>> with a couple of exceptions.
>>
>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>
>> Webrev URL: http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/
>>
>> The easiest way to re-review is to download the two patch files
>> (round 0 and round 2) and view them in your favorite file merge tool:
>>
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch
> ------------------------------------------------------------------------------
> src/share/vm/runtime/perfMemory.cpp
> 68 // Only destroy PerfData objects if we're at a safepoint and the
> 69 // StatSampler is not active. Otherwise, we risk removing PerfData
> 70 // objects that are currently being used by running JavaThreads
> 71 // or the StatSampler. This method is invoked while we are not at
> 72 // a safepoint during a VM abort so leaving the PerfData objects
> 73 // around may also help diagnose the failure.
> 74 //
> 75 if (SafepointSynchronize::is_at_safepoint() && !StatSampler::is_active()) {
> 76 PerfDataManager::destroy();
> 77 }
>
> I don't think that's the right predicate to use to control the call to
> PerfDataManager::destroy.
>
> We may be at a safepoint because (for example) we're in a GC pause,
> but there are multiple GC worker threads running that might attempt to
> touch the PerfData, and we get here because of a VM abort.
As has been pointed out before, we have no sightings of a crash using
PerfData outside of the monitor subsystem. If we spot such a crash in
the GC code, then the GC worker thread code should be updated to use
the new has_PerfData() check to reduce the likelihood of such a crash.
> Another problematic example is the one which got me here in the first
> place, e.g. vm_exit_during_initialization. At that point we haven't
> started any Java threads to be brought to a safepoint.
Buried in the notes for this bug (8049304) is data that showed that
with the test case for JDK-8129978, we are not at a safepoint (i.e.,
SafepointSynchronize::is_at_safepoint() returns false). With the
current fix in place, your test case from JDK-8129978 passed two
different 1000 iteration runs when it used to fail every time with
the fix not in place.
> That is, I think normal exit implies we're at a safepoint here, but
> the converse is not true.
I think you're missing the point of the safepoint check here. When
we are at a safepoint, most PerfData usage is safe so we can set the
flag, free the PerfData and finish shutting down. I've identified
two places in the Java monitor subsystem where we have PerfData
usage that may happen while we are at a safepoint and I've flagged
both of those spots with comments just in case the work around
fails down the road and someone has to go looking.
The point of that check is that it is (mostly) safe to free this
memory while we're at a safepoint and not otherwise. This check
doesn't care whether we are shutting down normally or abnormally
and I really don't want it to care.
Aside: I thought you'd be happy that I finally gave in and allowed
this memory to not be freed in the non-safepoint (implied abnormal
exit) case.
> And what we need to know is whether we're in a VM abort or a normal
> exit. [And vm_shutdown_during_initialization is an additional case
> that I don't fully understand. I'm not sure whether that should be
> treated as a normal exit or a VM abort.]
I think using safepoint state is a better predicate because it covers
more PerfData usage by default. I really don't think this code needs
to know why we are exiting.
> I think a reason the testing of these changes didn't show any of these
> problem cases is that the has_PerfData check has been added to monitor
> code that doesn't strictly need it.
I'm having trouble with the above sentence. You are saying that the
reason that problems cases didn't show up is because of unnecessary
checks added to the code. If not having the "unnecessary" checks
would make problem cases show up, then aren't they necessary, by
definition?
> Based on your analysis, I think
> the additional protection from PerfData deletion should only be needed
> in the two FutileWakeups cases. Applying it to other cases imposes an
> extra performance cost and masks some cases of the wrong test being
> used to decide whether to delete the PerfData. [See also below
> regarding the OM_PERFDATA_OP macro.]
My preference was to add the has_PerfData() check to all uses of
PerfData in the Java monitor subsystem. I think this is safer, more
consistent and more robust in the face of future changes. And later
on you agree that they won't cost much. :-)
Again, we'll have to disagree about "the wrong test being used to decide
whether to delete the PerfData".
> I think if we test for VM abort context then the
> StatSampler::is_active() test is also unnecessary. The comment in the
> old code says that normally the StatSampler is disengaged, but not
> during VM abort, e.g. that part of the test is to dodge a similar
> problem that would also be solved by distinguishing between VM abort
> and normal exit.
Again, my preference is to check for the most likely PerfData usage
conflicts. If we're at a safepoint, then most code is not using
PerfData objects. If the StatSampler is not active, then it is not
using PerfData objects.
Abnormal VM exit is not well defined currently and I don't want to
go down the path of trying to solidify determination of that state.
If that state is somehow solidified in the future, then I can see
adding that check to the predicate, but I don't want to use it as
the sole predicate.
> If trying to distinguish which case we're in based on call stack and
> passing around an argument proves difficult or overly intrusive, a
> different approach would be to use an (atomic) flag set when we know
> we're in VM abort context, and checked in perfMemory_exit.
Call stack walking? Not going there. We are actively trying to get
rid of cases where we have to determine state based on call stack
walking. It is expensive. We could add a flag that indicates when
the VM is aborting, but it will be error prone and that's not
happening with this bug fix. Nor do I think using abort state as
the sole predicate is the right choice.
> ------------------------------------------------------------------------------
> src/share/vm/runtime/objectMonitor.hpp
> 181 // Only perform a PerfData operation if the PerfData object has been
> 182 // allocated and if the PerfDataManager has not freed the PerfData
> 183 // objects which can happen at normal VM shutdown.
> 184 //
> 185 #define OM_PERFDATA_OP(f, op_str) \
> 186 do { \
> 187 if (ObjectMonitor::_sync_ ## f != NULL && \
> 188 PerfDataManager::has_PerfData()) { \
> 189 ObjectMonitor::_sync_ ## f->op_str; \
> 190 } \
> 191 } while (0)
>
> This part looks good to me.
>
> We should consider adding PerfDataManager::has_PerfData checks to all
> uses of PerfData. This can be done in a followup.
The different consumers of the PerfData subsystem need to decide
for themselves what kind of checks they need.
> --TLDR--
Is this "Too Long Don't Read" or
is this "Too Long Didn't Read"? :-)
> I liked the rev1 version's is_safe argument, though I think the
> implementation could be simpler with the right predicate being used
> for calling PerfDataManager::destroy (as discussed above). Note that
> I agree with all the discussion that led to the elimination of
> OrderAccess or Atomic usage. So I think it could be written as
>
> #define OM_PERFDATA_OP(f, opt_str, is_safe) \
> do { \
> if (ObjectMonitor::_sync_ ## f != NULL && \
> ((is_safe) || PerfDataManager::has_PerfData())) { \
> ObjectMonitor::_sync_ ## f->op_str; \
> } \
> } while (0)
Definitely a cleaner version of the macro, but David convinced
me that the 'is_safe' flag is not useful. This idea has been
abandoned for this fix (8049304). However, you can always revive
it (or something similar) in another fix proposal. You will have
to convince David (and me) that is_safe is useful.
> Distinguishing the two problem cases eliminates unnecessary overhead
> from the others, only doing the extra work where it is actually
> required.
The problem with not always checking PerfDataManager::has_PerfData()
is that the code is not as robust in the face of change. When I
placed all the "is_safe" values, that was based on my safepoint
state analysis of those call sites. Somewhere down the road, a
change might be made to the surrounding safepoint state and the
is_safe == true value for an OM_PERFDATA_OP may no longer be
right. There there was no easy way for the person changing the
code to know that the change they made broke the is_safe parameter.
> Distinguishing the two problem cases would also let us consider a more
> expensive implementation for them, in order to eliminate the race
> condition. I think an atomic "counter" can do the trick, but I want
> to think about it some more and do some more literature seaching. I
> also think this extra step of eliminating the race could be deferred
> to a followup (if done at all; see below).
I'm definitely done trying to eliminate the race with this bug fix
(8049304). Any such work would have to be done with a new bug.
> I recall there being comments that the "is_safe" name and description
> were confusing. I don't recall whether that discussion went anywhere
> before the decision was made to abandon that part. I think Dan's
> response:
>
> When 'is_safe' is true, the calling thread can safely and directly
> check has_PerfData because we are not at a safepoint so we can't be
> racing with a normal exit. In a normal exit, has_PerfData is only
> changed (and memory freed) at a safepoint.
>
> When 'is_safe' is false, then the code path we're on could be
> executing in parallel with a safepoint so we need to be more careful
> with accessing the has_PerfData flag...
>
> contained useful information that wasn't clear from the original
> description of the argument. If the argument were to be reinstated,
> the above quoted text should be mined to improve the code comment.
Keep it in mind if you choose to make more changes in this area.
However, you'll have to convince David H (and me) that the is_safe
parameter is useful.
> All that said, I'm now going to argue the contrary position.
Now I'm starting to understand TLDR :-)
> 1. The cost of the extra test is pretty small.
>
> 2. In combination with the delayed destruction, the extra test seems
> likely to be effectively sufficient. While there is a theoretical
> possibility of delayed destruction winning the race and leading to a
> post-destruction PerfData access failure, the likelyhood of that
> happening seems quite small, especially since we're at a safepoint if
> we're going to destroy the PerfData. The extra cost of a solution to
> the race, both in code complexity and performance, may not be
> worthwhile.
>
> 3. We should consider adding PerfDataManager::has_PerfData checks to
> all uses of PerfData. That's a nice and simple idiom. Doing analysis
> similar to what Dan did for the monitor cases would likely be hard in
> at least some cases, might also be fragile, and probably not worth the
> effort. Adding the test can be done in a followup.
>
> All this is my long-winded way of saying this part of rev2 looks good
> to me.
So you've just taken us on a long journey arguing both sides of
several different arguments... Wow. I think you have me beat for
longest code reviews :-) Good stuff to have in the archive to
document what we've discussed. Hard for folks to keep in their
heads.
> ------------------------------------------------------------------------------
> src/share/vm/runtime/perfData.hpp
> 671 volatile static jint _has_PerfData;
> 873 static bool has_PerfData() { return _has_PerfData != 0; }
>
> Given that _has_PerfData is being treated as an ordinary boolean, with
> no atomic operations involved to limit the possible type, shouldn't
> the type just be bool? Also need to update initialization and
> assignments.
The "jint" is a carry over from when OrderAccess functions were being
used to manage the flag. I'll take a look at cleaning this up.
> Also, I think the more usual order is "static volatile", both in
> hotspot and in other code bases I'm familiar with. Presently in
> hotspot/src there are 94 occurrences of "static volatile", and only
> one "volatile static".
I'll change "volatile static" -> "static volatile".
> ------------------------------------------------------------------------------
> src/share/vm/runtime/objectMonitor.cpp
> 1752 void ObjectMonitor::notify(TRAPS) {
> ...
> 1760 OM_PERFDATA_OP(Notifications, inc(1));
>
> Minor style inconsistency: This uses "inc(1)" where all other places
> use "inc()". Of course, I expect the generated code to be identical.
> It just looked odd.
"inc(1)" was what the original code used so I kept it.
>
> ------------------------------------------------------------------------------
>
More information about the serviceability-dev
mailing list