RFR (XS): 8188877: Improper synchronization in offer_termination

Mon Nov 27 14:53:28 UTC 2017

On 27/11/17 12:30, Andrew Dinn wrote:
> On 22/11/17 09:13, Andrew Haley wrote:
>> On 21/11/17 21:53, White, Derek wrote:
>>> My understanding is that the "acquire" semantics are entirely about
>>> memory ordering, within a CPU. In particular it prevents "following
>>> loads" from executing before the "load acquire".
>>>
>>> There is nothing in the "load acquire" that causes it to synchronize
>>> with the memory system more or less quickly than a naked load.
>>
>> The abstract architecture only specifies things in terms of ordering
>> between loads, but it has to be implemented somehow, and this is MESI
>> or something similar.  Stores cause invalidate messages to be sent,
>> and these are put into the reader's invalidate queue.  When that
>> reader executes a load barrier it marks all the entries currently in
>> its invalidate queue.  The next load will wait until all marked
>> entries have been applied to the reader's cache.
> 
> That's what happens when the reader executes a read barrier. The
> interesting question is what happens when the reader does not execute a
> read barrier.

The invalidate messages still arrive at the reader, but they sit in
the invalidate queue and aren't acted upon immediately.  Eventually
they must be processed, either lazily or because the reader's
invalidate queue fills up.

>>> Either kind of read will eventually notice that its local cached
>>> value has been invalidated and will fetch the updated value.
>>
>> Eventually, yes.
> That's a rather ill-defined span of time ;-)
> 
> I understand that you tested this and found that it took no longer than
> a few hundred microseconds. However, I really have to ask what precisely
> the reader was doing during the test?

Nothing except spinning and loading, and that's a few microseconds'
delay rather than a few hundred.

> Specifically, does the time taken to 'eventually' notice a write to the
> LDRed location depend upon what other instructions are executed between
> successive LDRs?

It's really hard to be definite about that.  In practice it may well
be that back-to-back local cache accesses saturate the CPU<->cache
interconnect so much that they delay the processing of invalidate
queue entries, but that's my speculation and it's secret sauce anyway.

It is likely, though, that if you issue a load barrier the next load
must cause the contents of the invalidate queue to be applied to the
cache, in order to ensure that everything happens in order.  So I
suspect that load barriers will cause changes to be seem earlier.
Having said that, load barriers slow down all reads for a while.

And one final caveat: I'm talking about MESI, but there are more
elaborate and sophisticated ways of making this stuff work.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671