RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Thu Sep 3 02:15:41 UTC 2015

On Sep 2, 2015, at 12:45 AM, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
> 
> Greetings,
> 
> I've updated the "fix" for this bug based on code review comments
> received in round 1. I've dropped most of the changes from round 1
> with a couple of exceptions.
> 
> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
> https://bugs.openjdk.java.net/browse/JDK-8049304
> 
> Webrev URL: http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/
> 
> The easiest way to re-review is to download the two patch files
> (round 0 and round 2) and view them in your favorite file merge tool:
> 
> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch

------------------------------------------------------------------------------ 
src/share/vm/runtime/perfMemory.cpp
  68   // Only destroy PerfData objects if we're at a safepoint and the
  69   // StatSampler is not active. Otherwise, we risk removing PerfData
  70   // objects that are currently being used by running JavaThreads
  71   // or the StatSampler. This method is invoked while we are not at
  72   // a safepoint during a VM abort so leaving the PerfData objects
  73   // around may also help diagnose the failure.
  74   //
  75   if (SafepointSynchronize::is_at_safepoint() && !StatSampler::is_active()) {
  76     PerfDataManager::destroy();
  77   }

I don't think that's the right predicate to use to control the call to
PerfDataManager::destroy.

We may be at a safepoint because (for example) we're in a GC pause,
but there are multiple GC worker threads running that might attempt to
touch the PerfData, and we get here because of a VM abort.

Another problematic example is the one which got me here in the first
place, e.g. vm_exit_during_initialization.  At that point we haven't
started any Java threads to be brought to a safepoint.

That is, I think normal exit implies we're at a safepoint here, but
the converse is not true.

And what we need to know is whether we're in a VM abort or a normal
exit.  [And vm_shutdown_during_initialization is an additional case
that I don't fully understand.  I'm not sure whether that should be
treated as a normal exit or a VM abort.]

I think a reason the testing of these changes didn't show any of these
problem cases is that the has_PerfData check has been added to monitor
code that doesn't strictly need it.  Based on your analysis, I think
the additional protection from PerfData deletion should only be needed
in the two FutileWakeups cases.  Applying it to other cases imposes an
extra performance cost and masks some cases of the wrong test being
used to decide whether to delete the PerfData.  [See also below
regarding the OM_PERFDATA_OP macro.]

I think if we test for VM abort context then the
StatSampler::is_active() test is also unnecessary.  The comment in the
old code says that normally the StatSampler is disengaged, but not
during VM abort, e.g. that part of the test is to dodge a similar
problem that would also be solved by distinguishing between VM abort
and normal exit.

If trying to distinguish which case we're in based on call stack and
passing around an argument proves difficult or overly intrusive, a
different approach would be to use an (atomic) flag set when we know
we're in VM abort context, and checked in perfMemory_exit.

------------------------------------------------------------------------------ 
src/share/vm/runtime/objectMonitor.hpp
 181   // Only perform a PerfData operation if the PerfData object has been
 182   // allocated and if the PerfDataManager has not freed the PerfData
 183   // objects which can happen at normal VM shutdown.
 184   //
 185   #define OM_PERFDATA_OP(f, op_str)              \
 186     do {                                         \
 187       if (ObjectMonitor::_sync_ ## f != NULL &&  \
 188           PerfDataManager::has_PerfData()) {     \
 189         ObjectMonitor::_sync_ ## f->op_str;      \
 190       }                                          \
 191     } while (0)

This part looks good to me.

We should consider adding PerfDataManager::has_PerfData checks to all
uses of PerfData.  This can be done in a followup.

--TLDR--

I liked the rev1 version's is_safe argument, though I think the
implementation could be simpler with the right predicate being used
for calling PerfDataManager::destroy (as discussed above).  Note that
I agree with all the discussion that led to the elimination of
OrderAccess or Atomic usage.  So I think it could be written as

#define OM_PERFDATA_OP(f, opt_str, is_safe)                \
  do {                                                     \
    if (ObjectMonitor::_sync_ ## f != NULL &&              \
        ((is_safe) || PerfDataManager::has_PerfData())) {  \
      ObjectMonitor::_sync_ ## f->op_str;                  \
    }                                                      \
  } while (0)

Distinguishing the two problem cases eliminates unnecessary overhead
from the others, only doing the extra work where it is actually
required. 

Distinguishing the two problem cases would also let us consider a more
expensive implementation for them, in order to eliminate the race
condition.  I think an atomic "counter" can do the trick, but I want
to think about it some more and do some more literature seaching.  I
also think this extra step of eliminating the race could be deferred
to a followup (if done at all; see below).

I recall there being comments that the "is_safe" name and description
were confusing.  I don't recall whether that discussion went anywhere
before the decision was made to abandon that part.  I think Dan's
response:

  When 'is_safe' is true, the calling thread can safely and directly
  check has_PerfData because we are not at a safepoint so we can't be
  racing with a normal exit. In a normal exit, has_PerfData is only
  changed (and memory freed) at a safepoint.

  When 'is_safe' is false, then the code path we're on could be
  executing in parallel with a safepoint so we need to be more careful
  with accessing the has_PerfData flag...

contained useful information that wasn't clear from the original
description of the argument.  If the argument were to be reinstated,
the above quoted text should be mined to improve the code comment.

All that said, I'm now going to argue the contrary position.

1. The cost of the extra test is pretty small.

2. In combination with the delayed destruction, the extra test seems
likely to be effectively sufficient.  While there is a theoretical
possibility of delayed destruction winning the race and leading to a
post-destruction PerfData access failure, the likelyhood of that
happening seems quite small, especially since we're at a safepoint if
we're going to destroy the PerfData.  The extra cost of a solution to
the race, both in code complexity and performance, may not be
worthwhile.

3. We should consider adding PerfDataManager::has_PerfData checks to
all uses of PerfData.  That's a nice and simple idiom.  Doing analysis
similar to what Dan did for the monitor cases would likely be hard in
at least some cases, might also be fragile, and probably not worth the
effort.  Adding the test can be done in a followup.

All this is my long-winded way of saying this part of rev2 looks good
to me.

------------------------------------------------------------------------------ 
src/share/vm/runtime/perfData.hpp
 671     volatile static jint _has_PerfData;
 873     static bool has_PerfData() { return _has_PerfData != 0; }

Given that _has_PerfData is being treated as an ordinary boolean, with
no atomic operations involved to limit the possible type, shouldn't
the type just be bool?  Also need to update initialization and
assignments. 

Also, I think the more usual order is "static volatile", both in
hotspot and in other code bases I'm familiar with.  Presently in
hotspot/src there are 94 occurrences of "static volatile", and only
one "volatile static".

------------------------------------------------------------------------------
src/share/vm/runtime/objectMonitor.cpp
1752 void ObjectMonitor::notify(TRAPS) {
...
1760   OM_PERFDATA_OP(Notifications, inc(1));

Minor style inconsistency: This uses "inc(1)" where all other places
use "inc()".  Of course, I expect the generated code to be identical.
It just looked odd.

------------------------------------------------------------------------------