[concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

Wed Dec 10 06:01:22 UTC 2014

On 10/12/2014 4:15 AM, Martin Buchholz wrote:
> On Mon, Dec 8, 2014 at 8:35 PM, David Holmes <david.holmes at oracle.com> wrote:
>
>>> So (as you say) with TSO you don't have a total order of stores if you
>>> read your own writes out of your own CPU's write buffer.  However, my
>>> interpretation of "multiple-copy atomic" is that the initial
>>> publishing thread can choose to use an instruction with sufficiently
>>> strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
>>> memory so that the write buffer is flushed and then use plain relaxed
>>> loads everywhere else to read those memory locations and this explains
>>> the situation on x86 and sparc where volatile writes are expensive and
>>> volatile reads are "free" and you get sequential consistency for Java
>>> volatiles.
>>
>>
>> We don't use lock'd instructions for volatile stores on x86, but the
>> trailing mfence achieves the "flushing".
>>
>> However this still raised some questions for me. Using a mfence on x86 or
>> equivalent on sparc, is no different to issuing a "DMB SYNC" on ARM, or a
>> SYNC on PowerPC. They each ensure TSO for volatile stores with global
>> visibility. So when such fences are used the resulting system should be
>> multiple-copy atomic - no? (No!**) And there seems to be an equivalence
>> between being multiple-copy atomic and providing the IRIW property. Yet we
>> know that on ARM/Power, as per the paper, TSO with global visibility is not
>
> ARM/Power don't have TSO.

Yes we all know that. Please re-read what I wrote.

>> sufficient to achieve IRIW. So what is it that x86 and sparc have in
>> addition to TSO that provides for IRIW?
>
> We have both been learning.... to think in new ways.
>   I found the second section of Peter Sewell's tutorial
> 2 From Sequential Consistency to Relaxed Memory Models
> to be most useful, especially the diagrams.
>
>> I pondered this for quite a while before realizing that the mfence on x86
>> (or equivalent on sparc) is not in fact playing the same role as the
>> DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can
>> ignore the store buffers) is that stores become globally visible - if any
>> other thread sees a store then all other threads see the same store. Whereas
>> on ARM/PPC you can imagine a store casually making its way through the
>> system, gradually becoming visible to more and more threads - unless there
>> is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW
>> placing the DMB/SYNC after the store does not suffice because prior to the
>> DMB/SYNC the store may be visible to an arbitrary subset of threads.
>> Consequently IRIW requires the DMB/SYNC between the loads - to ensure that
>> each thread on their second load, must see the value that the other thread
>> saw on its first load (ref Section 6.1 of the paper).
>>
>> ** So using DMB/SYNC does not achieve multiple-copy atomicity, because until
>> the DMB/SYNC happens different threads can have different views of memory.
>
> To me, the most desirable property of x86-style TSO is that barriers
> are only necessary on stores to achieve sequential consistency - the
> publisher gets to decide.  Volatile reads can then be close to free.

TSO doesn't need store barriers for sequential consistency.

It is somewhat amusing I think that the free-ness of volatile reads on 
TSO comes from the fact that all writes cause global memory 
synchronization. But because we can't turn that off we can't actually 
measure the cost we pay for those synchronizing writes. In contrast on 
non-TSO we have to explicitly cause synchronizing writes and so 
potentially require synchronizing reads - and then complain because the 
"hidden costs" are no longer hidden :)

>> All of which reinforces to me that IRIW is an undesirable property to have
>> to implement. YMMV. (And I also need to re-examine the PPC64 implementation
>> to see exactly where they add/remove barriers when IRIW is enabled.)
>
> I believe you get a full sync between volatile reads.
>
> #define GET_FIELD_VOLATILE(obj, offset, type_name, v) \
>    oop p = JNIHandles::resolve(obj); \
>    if (support_IRIW_for_not_multiple_copy_atomic_cpu) { \
>      OrderAccess::fence(); \
>    } \

Yes, it was more the "remove" part that I was unsure of the details - I 
think they simply remove the trailing fence (ie PPC SYNC) from the 
volatile writes.

Thanks,
David

>
>> Cheers,
>> David
>>
>>> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf