RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Tue Sep 1 14:07:12 UTC 2015

On 8/31/15 11:11 PM, David Holmes wrote:
> Hi Dan,
>
> On 1/09/2015 2:58 AM, Daniel D. Daugherty wrote:
>> Hi David,
>>
>> Replies embedded below...
>
> Likewise ... but with some trimming (I'm getting RSI hitting page-down).

Yup. I use an Apple MagicMouse and I'm getting tired of
the scrolling drag... :-)

>
>> On 8/30/15 7:30 PM, David Holmes wrote:
>>> On 29/08/2015 2:27 AM, Daniel D. Daugherty wrote:
>>>> On 8/28/15 8:45 AM, Tom Benson wrote:
>>>> For my shutdown use, we are transitioning from '1' -> '0' and
>>>> we need that to be seen proactively so:
>>>
>>> Nit: OrderAccess "barriers" enforce ordering constraints but don't in
>>> general provide any guarantees about visibility - ie they are not
>>> necessarily "flushes". So while it may be true on some platforms by
>>> virtue of the underlying barrier mechanism, in general they don't
>>> change when a write becomes visible and so there is nothing
>>> "proactive" about them.
>>
>> My apologies for being sloppy with my wording.
>>
>> What I'm trying to do is put up a barrier so that once the
>> deleting thread has changed the flag from '1' -> '0' another
>> thread trying to read the flag won't see the '1' when it
>> should see the '0'. In order words, I don't want the other
>> thread's read of the flag to float above the setting of
>> the flag to '0'.
>
> That still suggest to me a notion that the barrier somehow forces 
> visibility and that after the barrier_store returns all other threads 
> are now guaranteed to see the value '1' - which is not the case. All 
> the barriers here do is ensure a global observability ordering: if 
> another thread sees an action that happened after the barrier, then it 
> must see the action that happened before the barrier.

I'm not trying to say that visibility is forced. I was shooting
for something like "happens before" and "happens after", but I
didn't use those terms...

>
> More below ...
>
>>>
>>>> OrderAccess::release_store(&_has_PerfData, 0);
>>>>      OrderAccess::storeload();
>>>
>>> I agree with Tom that the release is unnecessary - though harmless.
>>
>> I tried to switch the code to:
>>
>>      OrderAccess::store(&_has_PerfData, 0)
>>      OrderAccess::storeload();
>>
>> but that wouldn't compile on Solaris X64:
>>
>> Error: static OrderAccess::store(volatile int*, int) is not accessible
>> from static PerfDataManager::destroy().
>
> Strange - either a bug or an issue with including the .inline.hpp file.

Right. Not worth chasing since I don't really need it.

>
>> But then it occurred to me that I was trying too hard
>> to use OrderAccess functions...
>>
>>> The real ordering constraint here is that we preserve:
>>>
>>> _has_PerfData = 0;
>>> <do actual deallocation>
>>
>> I think I can switch to a straight "_has_PerfData = 0;" here and
>> not use either OrderAccess::release_store() or OrderAccess::store().
>
> Yes. Particularly as OrderAccess::store is just a plain store for 
> 32-bit types.
>
>> I also reread OrderAccess.hpp again and convinced myself that
>> an "OrderAccess::fence()" call is the only way to be sure here.
>> I'm pretty sure that achieves 'storeload|storestore' barrier
>> semantics... Yes, I have a headache again... :-)
>
> Yes it does, as well as loadLoad and loadStore.
>
>>
>>> for which a storeload|storestore barrier after the write seems most
>>> appropriate. Though with the insertion of the sleep after the write
>>> there won't be any reordering anyway so explicit barriers seem 
>>> redundant.
>>
>> I didn't think that os::naked_short_sleep() would do anything
>> to keep another thread's read from floating above the
>> "_has_PerfData = 0;". What did I miss?
>
> This is somewhat of a side-discussion but ...
>
> First your notion of floating reads in other threads is somewhat 
> troubling to me. I kind of get what you mean but ...
>
> Now for the sleep. As always with ordering constraints there are two 
> sides: the writer and the reader. Here we are dealing with the writer. 
> The writer is executing:
>
> _has_PerfData = 0;
> for (int index = 0; index < _all->length(); index++) {
>   PerfData* p = _all->at(index);
>   delete p;
> }
>
> for the sake of discussion lets assume there's only one PerfData 
> object and that deleting it amounts to setting an internal field to 
> NULL. So our code becomes:
>
> _has_PerfData = 0;
> data->ref = NULL;
>
> The ordering constraint is that the above two statements happen in the 
> order written - so we don't delete the data while the data is claimed 
> to be valid. That requires both that the compiler not reorder them, 
> nor the hardware. If we did:
>
> _has_PerfData = 0;
> membar storeload|storeStore;
> data->ref = NULL;
>
> then we achieve that goal. Now consider if we instead have:
>
> _has_PerfData = 0;
> os::naked_short_sleep(1);
> data->ref = NULL;
>
> The sleep call, leads to a library call and a syscall and ultimately 
> likely a context switch as the current thread gets descheduled for 
> 1ms. In terms of machine instructions the distance between the initial 
> write of _has_PerfData and the load of 'data' and so the store to data 
> is huge. Can those stores be reordered by the hardware? Theoretically 
> yes, but in practice I say not at all likely: the separation is too 
> large and there are likely additional barriers already within the code 
> that will do the sleep. Can the stores be reordered by the compiler? 
> Again theoretically yes, but in practice for a compiler to reorder 
> across the sleep requires intimate knowledge of everything that 
> happens within the sleep - no compiler we currently use would even 
> attempt to try and figure that out. So the sleep, while not a 
> guarantee, is most likely to act to prevent the undesired reordering.
>
> Is that lack of guarantee an issue? No - because we know we already 
> have a race condition. No matter what we do with this section of code, 
> in terms of barriers, we can't avoid the race.

All the way to this point, I was expecting you to say leave the
barrier in...

> Hence my view that adding explicit barriers together with the sleep 
> serves no practical purpose here. It would be different if a barrier 
> forced the write to become visible to all other threads - but that is 
> not guaranteed by OrderAccess.

Not the conclusion I was expecting based on the last couple of
paragraphs. :-)

My preference is to leave the barrier in for these reasons:

- communicates intent to any future maintainer; we could also say that
   this might communicate my silliness at continuing with a lock free
   algorithm when everyone know that I hate them.
- does not make assumptions about os::naked_short_sleep(1); it might
   be turned into an in-line function someday that the compiler can
   digest and optimize around...
- does no harm; I think that's a paraphrase of what you said in the
   round 0 code review :-)

> Now lets return to the reader side of this, somewhere we have code 
> that is effectively doing something like:
>
> if (_has_PerfData == 1)
>    cur_data = *data->ref;
>
> This is where we see the race: the value of the flag can change the 
> instant after we read it and so we can access data->ref after it is 
> NULL. Going back to the sleep, the purpose of the sleep is to give 
> threads that have seen _has_PerfData==1 a chance to get past the data 
> access before the PerfData object is deleted.

In the monitor subsystem, the check is little different:

   if (data != NULL && _has_PerfData == 1) {
     cur_data = *data->ref
   }

so the read of data->ref can't float above the if-statement. It looks
like most (if not all) of the PerfData usage relies on NULL checks of
the PerfData object as the trigger to determine whether to call into
the PerfData object.

> In  terms of ordering constraints we don't want the read of data->ref 
> to be moved/float ahead of the read of _has_PerfData, so technically 
> we need a loadload barrier to prevent that:
>
> if (_has_PerfData == 1) {
>   membar loadload;
>   cur_data = *data->ref;
> }
>
> From a hardware perspective I don't think a speculative load can cause 
> a problem - if we read data->ref early and see NULL then we "must" see 
> _has_perfData==0 (due to ordering at the writer) and so the code is 
> skipped. And I think similarly for compiler reorderings (though my 
> head is hurting now too :)). But in either case it doesn't really 
> matter if this reasoning is wrong because even if the code is kept in 
> the required order the race still exists and we can't tell exactly how 
> things came about. So adding the loadload would achieve little, if 
> anything.

Agreed that a loadload would achieve little, if anything here.

>
> Now feel free to ignore all that, go get a coffee and move on to my 
> reply to your next webrev. :)

Will do. Needed a coffee for this one also...

Dan

>
> Cheers,
> David
>
>> My current plan:
>>
>> - switch from "OrderAccess::release_store(&_has_PerfData, 0);"
>>    to "_has_PerfData = 0;"
>> - keep "OrderAccess::fence();"
>> - retest and send out for another round of review
>>
>> Dan
>>
>>
>>
>>>
>>> Cheers,
>>> David
>>> -----
>>>
>>>> which is modeled after _owner field transitions from non-zeo
>>>> -> NULL in ObjectMonitor.cpp
>>>>
>>>>
>>>>>
>>>>> Also, I think the comment above that release_store() could be
>>>>> clarified.  It is fine as is if you're familiar with this bug report
>>>>> and discussion, but...  I think it should explicitly say there is
>>>>> still a very small window for the lack of true synchronization to
>>>>> cause a failure.  And perhaps that the release_store() (or
>>>>> store/release())  is not half of an acquire/release pair.
>>>>
>>>> Here's the existing comment:
>>>>
>>>>     286    // Clear the flag before we free the PerfData counters. 
>>>> Thus
>>>> begins
>>>>     287    // the race between this thread and another thread that
>>>> has just
>>>>     288    // queried PerfDataManager::has_PerfData() and gotten back
>>>> 'true'.
>>>>     289    // The hope is that the other thread will finish its 
>>>> PerfData
>>>>     290    // manipulation before we free the memory. The two
>>>> alternatives
>>>>     291    // are 1) leak the PerfData memory or 2) do some form of
>>>> ordered
>>>>     292    // access before every PerfData operation.
>>>>
>>>> I think it pretty clearly states that there is still a race here.
>>>> And I think that option 2 covers that we're not doing completely
>>>> safe ordered access. I'm not sure how to make this comment more
>>>> clear, but if you have specific suggestions...
>>>>
>>>> Dan
>>>>
>>>>
>>>>>
>>>>> Tom
>>>>
>>