RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Fri Aug 28 16:27:39 UTC 2015

On 8/28/15 8:45 AM, Tom Benson wrote:
> Hi,
> One more pair of eyes on this.  8^)

Hi Tom!

Thanks for reviewing and welcome to the party...

>
> On 8/27/2015 8:16 PM, Kim Barrett wrote:
>> On Aug 27, 2015, at 5:42 PM, Daniel D. Daugherty 
>> <daniel.daugherty at oracle.com> wrote:
>>> Sorry for starting another e-mail thread fork in an already complicated
>>> review...
>> OK, that was fascinating.  No, really, I mean it.
>>
>> It made me realize that we've been arguing and talking past each other
>> in part because we're really dealing with two distinct though closely
>> related bugs here.
>>
>> I've been primarily thinking about the case where we're calling
>> vm_abort / os::abort, where the we presently delete the PerfData
>> memory even though there can be arbitrary other threads running.  This
>> was the case in JDK-8129978, which is how I got involved here in the
>> first place.  In that bug we were in vm_exit_during_initialization and
>> had called perfMemory_exit when some thread attempted to inflate a
>> monitor (which is not one of the conflicting cases discussed by Dan).
>>
>> The problem Dan has been looking at, JDK-8049304, is about a "normal"
>> VM shutdown.  In this case, the problem is that we believe it is safe
>> to delete the PerfData, because we've safepointed, and yet some thread
>> unexpectedly runs and attempts to touch the deleted data anyway.
>>
>> I think Dan's proposed fix (mostly) avoids the specific instance of
>> JDK-8129978, but doesn't solve the more general problem of abnormal
>> exit deleting the PerfData while some running thread is touching some
>> non-monitor-related part of that data.  My proposal to leave it to the
>> OS to deal with memory cleanup on process exit would deal with this
>> case.
>>
>> I think Dan's proposed fix (mostly) avoids problems like JDK-8049304.
>> And the approach I've been talking about doesn't help at all for this
>> case.  But I wonder if Dan's proposed fix can be improved.  A "futile
>> wakeup" case doesn't seem to me like one which requires super-high
>> performance.  Would it be ok, in the two problematic cases that Dan
>> identified, to use some kind of atomic / locking protocol with the
>> cleanup?  Or is the comment for the counter increment in EnterI (and
>> only there) correct that it's important to avoid a lock or atomics
>> here (and presumably in ReenterI too).
>>
>
> I notice that EnteriI/ReenterI both end with OrderAccess::fence(). Can 
> the potential update of _sync_FutileWakeups be delayed until that 
> point, to take advantage of the fence to make the sync hole even smaller?

Not easily with EnterI() since there is one optional optimization
between the OM_PERFDATA_OP(FutileWakeups, inc()) call and the
OrderAccess::fence() call and that would result in lost FutileWakeups
increments.

Not easily in ReenterI(), the OM_PERFDATA_OP(FutileWakeups, inc()) call
is at the bottom of the for-loop and the OrderAccess::fence() call at
the end of the function is outside the loop. This would result in lost
FutileWakeups increments.

So in ReenterI() the OM_PERFDATA_OP(FutileWakeups, inc()) call immediately
follows an OrderAccess::fence() call. Doesn't that make that increment as
"safe" as it can be without having a real lock?

> You've got a release() (and and short nap!) with the store in 
> PerfDataManager::destroy() to try to close the window somewhat.

Yes, I modeled that after:

src/share/vm/runtime/perfMemory.cpp:

     83  void PerfMemory::initialize() {
<snip>
    156    OrderAccess::release_store(&_initialized, 1);
    157  }

> But I think rather than the release_store()  you used, you want a 
> store, followed by a release().   release_store() puts a fence before 
> the store to ensure earlier updates are seen before the current one, no?

Yup, and I see I got my reasoning wrong. The code I modeled
is right because you want to flush all the inits and it's OK
if the _initialized transition from '0' -> '1' is lazily seen.

For my shutdown use, we are transitioning from '1' -> '0' and
we need that to be seen proactively so:

     OrderAccess::release_store(&_has_PerfData, 0);
     OrderAccess::storeload();

which is modeled after _owner field transitions from non-zeo
-> NULL in ObjectMonitor.cpp

>
> Also, I think the comment above that release_store() could be 
> clarified.  It is fine as is if you're familiar with this bug report 
> and discussion, but...  I think it should explicitly say there is 
> still a very small window for the lack of true synchronization to 
> cause a failure.  And perhaps that the release_store() (or 
> store/release())  is not half of an acquire/release pair.

Here's the existing comment:

    286    // Clear the flag before we free the PerfData counters. Thus 
begins
    287    // the race between this thread and another thread that has just
    288    // queried PerfDataManager::has_PerfData() and gotten back 
'true'.
    289    // The hope is that the other thread will finish its PerfData
    290    // manipulation before we free the memory. The two alternatives
    291    // are 1) leak the PerfData memory or 2) do some form of ordered
    292    // access before every PerfData operation.

I think it pretty clearly states that there is still a race here.
And I think that option 2 covers that we're not doing completely
safe ordered access. I'm not sure how to make this comment more
clear, but if you have specific suggestions...

Dan

>
> Tom