RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()
Tom Benson
tom.benson at oracle.com
Fri Aug 28 17:57:39 UTC 2015
Hi Dan,
On 8/28/2015 12:27 PM, Daniel D. Daugherty wrote:
> On 8/28/15 8:45 AM, Tom Benson wrote:
>> Hi,
>> One more pair of eyes on this. 8^)
>
> Hi Tom!
>
> Thanks for reviewing and welcome to the party...
>
>
>>
>> On 8/27/2015 8:16 PM, Kim Barrett wrote:
>>> On Aug 27, 2015, at 5:42 PM, Daniel D. Daugherty
>>> <daniel.daugherty at oracle.com> wrote:
>>>> Sorry for starting another e-mail thread fork in an already
>>>> complicated
>>>> review...
>>> OK, that was fascinating. No, really, I mean it.
>>>
>>> It made me realize that we've been arguing and talking past each other
>>> in part because we're really dealing with two distinct though closely
>>> related bugs here.
>>>
>>> I've been primarily thinking about the case where we're calling
>>> vm_abort / os::abort, where the we presently delete the PerfData
>>> memory even though there can be arbitrary other threads running. This
>>> was the case in JDK-8129978, which is how I got involved here in the
>>> first place. In that bug we were in vm_exit_during_initialization and
>>> had called perfMemory_exit when some thread attempted to inflate a
>>> monitor (which is not one of the conflicting cases discussed by Dan).
>>>
>>> The problem Dan has been looking at, JDK-8049304, is about a "normal"
>>> VM shutdown. In this case, the problem is that we believe it is safe
>>> to delete the PerfData, because we've safepointed, and yet some thread
>>> unexpectedly runs and attempts to touch the deleted data anyway.
>>>
>>> I think Dan's proposed fix (mostly) avoids the specific instance of
>>> JDK-8129978, but doesn't solve the more general problem of abnormal
>>> exit deleting the PerfData while some running thread is touching some
>>> non-monitor-related part of that data. My proposal to leave it to the
>>> OS to deal with memory cleanup on process exit would deal with this
>>> case.
>>>
>>> I think Dan's proposed fix (mostly) avoids problems like JDK-8049304.
>>> And the approach I've been talking about doesn't help at all for this
>>> case. But I wonder if Dan's proposed fix can be improved. A "futile
>>> wakeup" case doesn't seem to me like one which requires super-high
>>> performance. Would it be ok, in the two problematic cases that Dan
>>> identified, to use some kind of atomic / locking protocol with the
>>> cleanup? Or is the comment for the counter increment in EnterI (and
>>> only there) correct that it's important to avoid a lock or atomics
>>> here (and presumably in ReenterI too).
>>>
>>
>> I notice that EnteriI/ReenterI both end with OrderAccess::fence().
>> Can the potential update of _sync_FutileWakeups be delayed until that
>> point, to take advantage of the fence to make the sync hole even
>> smaller?
>
> Not easily with EnterI() since there is one optional optimization
> between the OM_PERFDATA_OP(FutileWakeups, inc()) call and the
> OrderAccess::fence() call and that would result in lost FutileWakeups
> increments.
>
> Not easily in ReenterI(), the OM_PERFDATA_OP(FutileWakeups, inc()) call
> is at the bottom of the for-loop and the OrderAccess::fence() call at
> the end of the function is outside the loop. This would result in lost
> FutileWakeups increments.
Yes, you'd have to keep a local count, and then add the total outside
the loop for both Enter/Reenter, after the fence. But I see what you
mean about the other exit paths in Enter. (The more I look at this
code, the more I remember it... BTW, are those knob_ setting defaults
ever going to be moved to a platform specific-module? That was my beef
(well, one of them) in 2 different ports. Or is the goal to make
monitors so well self-tuning that they can go away? Sorry for the
digression... 8^))
At any rate, as you say, perhaps it's not worth it to leverage the
fences, though it could be done.
>
> So in ReenterI() the OM_PERFDATA_OP(FutileWakeups, inc()) call
> immediately
> follows an OrderAccess::fence() call. Doesn't that make that increment as
> "safe" as it can be without having a real lock?
>
>
>> You've got a release() (and and short nap!) with the store in
>> PerfDataManager::destroy() to try to close the window somewhat.
>
> Yes, I modeled that after:
>
> src/share/vm/runtime/perfMemory.cpp:
>
> 83 void PerfMemory::initialize() {
> <snip>
> 156 OrderAccess::release_store(&_initialized, 1);
> 157 }
>
>
>> But I think rather than the release_store() you used, you want a
>> store, followed by a release(). release_store() puts a fence before
>> the store to ensure earlier updates are seen before the current one, no?
>
> Yup, and I see I got my reasoning wrong. The code I modeled
> is right because you want to flush all the inits and it's OK
> if the _initialized transition from '0' -> '1' is lazily seen.
>
> For my shutdown use, we are transitioning from '1' -> '0' and
> we need that to be seen proactively so:
>
> OrderAccess::release_store(&_has_PerfData, 0);
> OrderAccess::storeload();
>
> which is modeled after _owner field transitions from non-zeo
> -> NULL in ObjectMonitor.cpp
>
It's not clear to me why the store needs to be a release_store in this
case, as long as the storeload() follows it. You're not protecting any
earlier writes. ?
>
>>
>> Also, I think the comment above that release_store() could be
>> clarified. It is fine as is if you're familiar with this bug report
>> and discussion, but... I think it should explicitly say there is
>> still a very small window for the lack of true synchronization to
>> cause a failure. And perhaps that the release_store() (or
>> store/release()) is not half of an acquire/release pair.
>
> Here's the existing comment:
>
> 286 // Clear the flag before we free the PerfData counters. Thus
> begins
> 287 // the race between this thread and another thread that has
> just
> 288 // queried PerfDataManager::has_PerfData() and gotten back
> 'true'.
> 289 // The hope is that the other thread will finish its PerfData
> 290 // manipulation before we free the memory. The two alternatives
> 291 // are 1) leak the PerfData memory or 2) do some form of
> ordered
> 292 // access before every PerfData operation.
>
> I think it pretty clearly states that there is still a race here.
> And I think that option 2 covers that we're not doing completely
> safe ordered access. I'm not sure how to make this comment more
> clear, but if you have specific suggestions...
>
OK, I guess it's subjective.
Tom
> Dan
>
>
>>
>> Tom
>
More information about the serviceability-dev
mailing list