[concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

Tue Dec 9 04:35:43 UTC 2014

Hi Martin,

On 9/12/2014 5:17 AM, Martin Buchholz wrote:
> On Mon, Dec 8, 2014 at 12:46 AM, David Holmes <davidcholmes at aapt.net.au> wrote:
>> Martin,
>>
>> The paper you cite is about ARM and Power architectures - why do you think the lack of mention of x86/sparc implies those architectures are multiple-copy-atomic?
>
> Reading some more in the same paper, I see:

Yes - mea culpa - I should have re-read the paper (I wouldn't have had 
to read very far).

> """Returning to the two properties above, in TSO a thread can see its
> own writes before they become visible to other
> threads (by reading them from its write buffer), but any write becomes
> visible to all other threads simultaneously: TSO
> is a multiple-copy atomic model, in the terminology of Collier
> [Col92]. One can also see the possibility of reading
> from the local write buffer as allowing a specific kind of local
> reordering. A program that writes one location x then
> reads another location y might execute by adding the write to x to the
> thread’s buffer, then reading y from memory,
> before finally making the write to x visible to other threads by
> flushing it from the buffer. In this case the thread reads
> the value of y that was in the memory before the new write of x hits memory."""

So I learnt two things from this:

1. The ARM architecture manual definition of "multi-copy atomicity" is 
not the same as that in the paper for "multiple-copy atomicity". The 
distinction being that the paper allows for a thread to read from its 
own store buffer, as long as all other threads must see the store at the 
same time. That is quite an important difference in terms of classifying 
systems.

2. I had thought that the store buffer might be shared - if not across 
cores then at least across different hardware threads on the same core. 
But it seems that is not the case either (based on the paper and my own 
reading of SPARC architecture info - stores attain global visibility at 
the L2 cache.)

So given that, yes I agree that sparc and x86 are multiple-copy atomic 
as defined by the paper.

> So (as you say) with TSO you don't have a total order of stores if you
> read your own writes out of your own CPU's write buffer.  However, my
> interpretation of "multiple-copy atomic" is that the initial
> publishing thread can choose to use an instruction with sufficiently
> strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
> memory so that the write buffer is flushed and then use plain relaxed
> loads everywhere else to read those memory locations and this explains
> the situation on x86 and sparc where volatile writes are expensive and
> volatile reads are "free" and you get sequential consistency for Java
> volatiles.

We don't use lock'd instructions for volatile stores on x86, but the 
trailing mfence achieves the "flushing".

However this still raised some questions for me. Using a mfence on x86 
or equivalent on sparc, is no different to issuing a "DMB SYNC" on ARM, 
or a SYNC on PowerPC. They each ensure TSO for volatile stores with 
global visibility. So when such fences are used the resulting system 
should be multiple-copy atomic - no? (No!**) And there seems to be an 
equivalence between being multiple-copy atomic and providing the IRIW 
property. Yet we know that on ARM/Power, as per the paper, TSO with 
global visibility is not sufficient to achieve IRIW. So what is it that 
x86 and sparc have in addition to TSO that provides for IRIW?

I pondered this for quite a while before realizing that the mfence on 
x86 (or equivalent on sparc) is not in fact playing the same role as the 
DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we 
can ignore the store buffers) is that stores become globally visible - 
if any other thread sees a store then all other threads see the same 
store. Whereas on ARM/PPC you can imagine a store casually making its 
way through the system, gradually becoming visible to more and more 
threads - unless there is a DMB/SYNC to force a globally consistent 
memory view. Hence for IRIW placing the DMB/SYNC after the store does 
not suffice because prior to the DMB/SYNC the store may be visible to an 
arbitrary subset of threads. Consequently IRIW requires the DMB/SYNC 
between the loads - to ensure that each thread on their second load, 
must see the value that the other thread saw on its first load (ref 
Section 6.1 of the paper).

** So using DMB/SYNC does not achieve multiple-copy atomicity, because 
until the DMB/SYNC happens different threads can have different views of 
memory.

All of which reinforces to me that IRIW is an undesirable property to 
have to implement. YMMV. (And I also need to re-examine the PPC64 
implementation to see exactly where they add/remove barriers when IRIW 
is enabled.)

Cheers,
David

> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>