[concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics
David Holmes
david.holmes at oracle.com
Tue Dec 9 04:35:43 UTC 2014
Hi Martin,
On 9/12/2014 5:17 AM, Martin Buchholz wrote:
> On Mon, Dec 8, 2014 at 12:46 AM, David Holmes <davidcholmes at aapt.net.au> wrote:
>> Martin,
>>
>> The paper you cite is about ARM and Power architectures - why do you think the lack of mention of x86/sparc implies those architectures are multiple-copy-atomic?
>
> Reading some more in the same paper, I see:
Yes - mea culpa - I should have re-read the paper (I wouldn't have had
to read very far).
> """Returning to the two properties above, in TSO a thread can see its
> own writes before they become visible to other
> threads (by reading them from its write buffer), but any write becomes
> visible to all other threads simultaneously: TSO
> is a multiple-copy atomic model, in the terminology of Collier
> [Col92]. One can also see the possibility of reading
> from the local write buffer as allowing a specific kind of local
> reordering. A program that writes one location x then
> reads another location y might execute by adding the write to x to the
> thread’s buffer, then reading y from memory,
> before finally making the write to x visible to other threads by
> flushing it from the buffer. In this case the thread reads
> the value of y that was in the memory before the new write of x hits memory."""
So I learnt two things from this:
1. The ARM architecture manual definition of "multi-copy atomicity" is
not the same as that in the paper for "multiple-copy atomicity". The
distinction being that the paper allows for a thread to read from its
own store buffer, as long as all other threads must see the store at the
same time. That is quite an important difference in terms of classifying
systems.
2. I had thought that the store buffer might be shared - if not across
cores then at least across different hardware threads on the same core.
But it seems that is not the case either (based on the paper and my own
reading of SPARC architecture info - stores attain global visibility at
the L2 cache.)
So given that, yes I agree that sparc and x86 are multiple-copy atomic
as defined by the paper.
> So (as you say) with TSO you don't have a total order of stores if you
> read your own writes out of your own CPU's write buffer. However, my
> interpretation of "multiple-copy atomic" is that the initial
> publishing thread can choose to use an instruction with sufficiently
> strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
> memory so that the write buffer is flushed and then use plain relaxed
> loads everywhere else to read those memory locations and this explains
> the situation on x86 and sparc where volatile writes are expensive and
> volatile reads are "free" and you get sequential consistency for Java
> volatiles.
We don't use lock'd instructions for volatile stores on x86, but the
trailing mfence achieves the "flushing".
However this still raised some questions for me. Using a mfence on x86
or equivalent on sparc, is no different to issuing a "DMB SYNC" on ARM,
or a SYNC on PowerPC. They each ensure TSO for volatile stores with
global visibility. So when such fences are used the resulting system
should be multiple-copy atomic - no? (No!**) And there seems to be an
equivalence between being multiple-copy atomic and providing the IRIW
property. Yet we know that on ARM/Power, as per the paper, TSO with
global visibility is not sufficient to achieve IRIW. So what is it that
x86 and sparc have in addition to TSO that provides for IRIW?
I pondered this for quite a while before realizing that the mfence on
x86 (or equivalent on sparc) is not in fact playing the same role as the
DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we
can ignore the store buffers) is that stores become globally visible -
if any other thread sees a store then all other threads see the same
store. Whereas on ARM/PPC you can imagine a store casually making its
way through the system, gradually becoming visible to more and more
threads - unless there is a DMB/SYNC to force a globally consistent
memory view. Hence for IRIW placing the DMB/SYNC after the store does
not suffice because prior to the DMB/SYNC the store may be visible to an
arbitrary subset of threads. Consequently IRIW requires the DMB/SYNC
between the loads - to ensure that each thread on their second load,
must see the value that the other thread saw on its first load (ref
Section 6.1 of the paper).
** So using DMB/SYNC does not achieve multiple-copy atomicity, because
until the DMB/SYNC happens different threads can have different views of
memory.
All of which reinforces to me that IRIW is an undesirable property to
have to implement. YMMV. (And I also need to re-examine the PPC64
implementation to see exactly where they add/remove barriers when IRIW
is enabled.)
Cheers,
David
> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>
More information about the core-libs-dev
mailing list