[concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

Tue Dec 9 18:15:03 UTC 2014

On Mon, Dec 8, 2014 at 8:35 PM, David Holmes <david.holmes at oracle.com> wrote:

>> So (as you say) with TSO you don't have a total order of stores if you
>> read your own writes out of your own CPU's write buffer.  However, my
>> interpretation of "multiple-copy atomic" is that the initial
>> publishing thread can choose to use an instruction with sufficiently
>> strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
>> memory so that the write buffer is flushed and then use plain relaxed
>> loads everywhere else to read those memory locations and this explains
>> the situation on x86 and sparc where volatile writes are expensive and
>> volatile reads are "free" and you get sequential consistency for Java
>> volatiles.
>
>
> We don't use lock'd instructions for volatile stores on x86, but the
> trailing mfence achieves the "flushing".
>
> However this still raised some questions for me. Using a mfence on x86 or
> equivalent on sparc, is no different to issuing a "DMB SYNC" on ARM, or a
> SYNC on PowerPC. They each ensure TSO for volatile stores with global
> visibility. So when such fences are used the resulting system should be
> multiple-copy atomic - no? (No!**) And there seems to be an equivalence
> between being multiple-copy atomic and providing the IRIW property. Yet we
> know that on ARM/Power, as per the paper, TSO with global visibility is not

ARM/Power don't have TSO.

> sufficient to achieve IRIW. So what is it that x86 and sparc have in
> addition to TSO that provides for IRIW?

We have both been learning.... to think in new ways.
 I found the second section of Peter Sewell's tutorial
2 From Sequential Consistency to Relaxed Memory Models
to be most useful, especially the diagrams.

> I pondered this for quite a while before realizing that the mfence on x86
> (or equivalent on sparc) is not in fact playing the same role as the
> DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can
> ignore the store buffers) is that stores become globally visible - if any
> other thread sees a store then all other threads see the same store. Whereas
> on ARM/PPC you can imagine a store casually making its way through the
> system, gradually becoming visible to more and more threads - unless there
> is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW
> placing the DMB/SYNC after the store does not suffice because prior to the
> DMB/SYNC the store may be visible to an arbitrary subset of threads.
> Consequently IRIW requires the DMB/SYNC between the loads - to ensure that
> each thread on their second load, must see the value that the other thread
> saw on its first load (ref Section 6.1 of the paper).
>
> ** So using DMB/SYNC does not achieve multiple-copy atomicity, because until
> the DMB/SYNC happens different threads can have different views of memory.

To me, the most desirable property of x86-style TSO is that barriers
are only necessary on stores to achieve sequential consistency - the
publisher gets to decide.  Volatile reads can then be close to free.

> All of which reinforces to me that IRIW is an undesirable property to have
> to implement. YMMV. (And I also need to re-examine the PPC64 implementation
> to see exactly where they add/remove barriers when IRIW is enabled.)

I believe you get a full sync between volatile reads.

#define GET_FIELD_VOLATILE(obj, offset, type_name, v) \
  oop p = JNIHandles::resolve(obj); \
  if (support_IRIW_for_not_multiple_copy_atomic_cpu) { \
    OrderAccess::fence(); \
  } \

> Cheers,
> David
>
>> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf