RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()
Daniel D. Daugherty
daniel.daugherty at oracle.com
Fri Aug 28 18:12:03 UTC 2015
On 8/28/15 11:57 AM, Tom Benson wrote:
> Hi Dan,
>
> On 8/28/2015 12:27 PM, Daniel D. Daugherty wrote:
>> On 8/28/15 8:45 AM, Tom Benson wrote:
>>> Hi,
>>> One more pair of eyes on this. 8^)
>>
>> Hi Tom!
>>
>> Thanks for reviewing and welcome to the party...
>>
>>
>>>
>>> On 8/27/2015 8:16 PM, Kim Barrett wrote:
>>>> On Aug 27, 2015, at 5:42 PM, Daniel D. Daugherty
>>>> <daniel.daugherty at oracle.com> wrote:
>>>>> Sorry for starting another e-mail thread fork in an already
>>>>> complicated
>>>>> review...
>>>> OK, that was fascinating. No, really, I mean it.
>>>>
>>>> It made me realize that we've been arguing and talking past each other
>>>> in part because we're really dealing with two distinct though closely
>>>> related bugs here.
>>>>
>>>> I've been primarily thinking about the case where we're calling
>>>> vm_abort / os::abort, where the we presently delete the PerfData
>>>> memory even though there can be arbitrary other threads running. This
>>>> was the case in JDK-8129978, which is how I got involved here in the
>>>> first place. In that bug we were in vm_exit_during_initialization and
>>>> had called perfMemory_exit when some thread attempted to inflate a
>>>> monitor (which is not one of the conflicting cases discussed by Dan).
>>>>
>>>> The problem Dan has been looking at, JDK-8049304, is about a "normal"
>>>> VM shutdown. In this case, the problem is that we believe it is safe
>>>> to delete the PerfData, because we've safepointed, and yet some thread
>>>> unexpectedly runs and attempts to touch the deleted data anyway.
>>>>
>>>> I think Dan's proposed fix (mostly) avoids the specific instance of
>>>> JDK-8129978, but doesn't solve the more general problem of abnormal
>>>> exit deleting the PerfData while some running thread is touching some
>>>> non-monitor-related part of that data. My proposal to leave it to the
>>>> OS to deal with memory cleanup on process exit would deal with this
>>>> case.
>>>>
>>>> I think Dan's proposed fix (mostly) avoids problems like JDK-8049304.
>>>> And the approach I've been talking about doesn't help at all for this
>>>> case. But I wonder if Dan's proposed fix can be improved. A "futile
>>>> wakeup" case doesn't seem to me like one which requires super-high
>>>> performance. Would it be ok, in the two problematic cases that Dan
>>>> identified, to use some kind of atomic / locking protocol with the
>>>> cleanup? Or is the comment for the counter increment in EnterI (and
>>>> only there) correct that it's important to avoid a lock or atomics
>>>> here (and presumably in ReenterI too).
>>>>
>>>
>>> I notice that EnteriI/ReenterI both end with OrderAccess::fence().
>>> Can the potential update of _sync_FutileWakeups be delayed until
>>> that point, to take advantage of the fence to make the sync hole
>>> even smaller?
>>
>> Not easily with EnterI() since there is one optional optimization
>> between the OM_PERFDATA_OP(FutileWakeups, inc()) call and the
>> OrderAccess::fence() call and that would result in lost FutileWakeups
>> increments.
>>
>> Not easily in ReenterI(), the OM_PERFDATA_OP(FutileWakeups, inc()) call
>> is at the bottom of the for-loop and the OrderAccess::fence() call at
>> the end of the function is outside the loop. This would result in lost
>> FutileWakeups increments.
> Yes, you'd have to keep a local count, and then add the total outside
> the loop for both Enter/Reenter, after the fence. But I see what you
> mean about the other exit paths in Enter. (The more I look at this
> code, the more I remember it...
I'm sure that is a wonderfully loving memory too!
> BTW, are those knob_ setting defaults ever going to be moved to a
> platform specific-module? That was my beef (well, one of them) in 2
> different ports. Or is the goal to make monitors so well self-tuning
> that they can go away? Sorry for the digression... 8^))
Dice is working on another idea to move tuning to a separate loadable
module which is why we deferred the "adaptive spin" and "SpinPause on
SPARC" buckets for the Contended Locking project.
> At any rate, as you say, perhaps it's not worth it to leverage the
> fences, though it could be done.
OK so we're agreed on no change here.
>
>>
>> So in ReenterI() the OM_PERFDATA_OP(FutileWakeups, inc()) call
>> immediately
>> follows an OrderAccess::fence() call. Doesn't that make that
>> increment as
>> "safe" as it can be without having a real lock?
>>
>>
>>> You've got a release() (and and short nap!) with the store in
>>> PerfDataManager::destroy() to try to close the window somewhat.
>>
>> Yes, I modeled that after:
>>
>> src/share/vm/runtime/perfMemory.cpp:
>>
>> 83 void PerfMemory::initialize() {
>> <snip>
>> 156 OrderAccess::release_store(&_initialized, 1);
>> 157 }
>>
>>
>>> But I think rather than the release_store() you used, you want a
>>> store, followed by a release(). release_store() puts a fence
>>> before the store to ensure earlier updates are seen before the
>>> current one, no?
>>
>> Yup, and I see I got my reasoning wrong. The code I modeled
>> is right because you want to flush all the inits and it's OK
>> if the _initialized transition from '0' -> '1' is lazily seen.
>>
>> For my shutdown use, we are transitioning from '1' -> '0' and
>> we need that to be seen proactively so:
>>
>> OrderAccess::release_store(&_has_PerfData, 0);
>> OrderAccess::storeload();
>>
>> which is modeled after _owner field transitions from non-zero
>> -> NULL in ObjectMonitor.cpp
>>
>
> It's not clear to me why the store needs to be a release_store in this
> case, as long as the storeload() follows it. You're not protecting
> any earlier writes. ?
I'm following the model that Dice uses for ObjectMonitors when we
change the _owner field from non-NULL -> NULL. There are some long
comments in the ObjectMonitor.cpp stuff that talk about why it is
done this way. I'm planning to file a cleanup bug for the non-NULL
-> NULL transition stuff because the comments and code are not
consistent.
>
>>
>>>
>>> Also, I think the comment above that release_store() could be
>>> clarified. It is fine as is if you're familiar with this bug report
>>> and discussion, but... I think it should explicitly say there is
>>> still a very small window for the lack of true synchronization to
>>> cause a failure. And perhaps that the release_store() (or
>>> store/release()) is not half of an acquire/release pair.
>>
>> Here's the existing comment:
>>
>> 286 // Clear the flag before we free the PerfData counters.
>> Thus begins
>> 287 // the race between this thread and another thread that has
>> just
>> 288 // queried PerfDataManager::has_PerfData() and gotten back
>> 'true'.
>> 289 // The hope is that the other thread will finish its PerfData
>> 290 // manipulation before we free the memory. The two
>> alternatives
>> 291 // are 1) leak the PerfData memory or 2) do some form of
>> ordered
>> 292 // access before every PerfData operation.
>>
>> I think it pretty clearly states that there is still a race here.
>> And I think that option 2 covers that we're not doing completely
>> safe ordered access. I'm not sure how to make this comment more
>> clear, but if you have specific suggestions...
>>
> OK, I guess it's subjective.
I'll always take wording tweaks so if something occurs to you
later on... :-)
Dan
> Tom
>
>> Dan
>>
>>
>>>
>>> Tom
>>
>
More information about the hotspot-runtime-dev
mailing list