[concurrency-interest] RFR: 8065804: JEP171:Clarifications/corrections for fence intrinsics

Tue Dec 9 20:04:16 UTC 2014

On 26/11/2014 02:04, Hans Boehm wrote:
> To be concrete here, on Power, loads can normally be ordered by an 
> address dependency or light-weight fence (lwsync).  However, neither 
> is enough to prevent the questionable outcome for IRIW, since it 
> doesn't ensure that the stores in T1 and T2 will be made visible to 
> other threads in a consistent order.  That outcome can be prevented by 
> using heavyweight fences (sync) instructions between the loads instead.

Why would they need fences between loads instead of syncing the order of 
stores?

Alex

> Peter Sewell's group concluded that to enforce correct volatile 
> behavior on Power, you essentially need a a heavyweight fence between 
> every pair of volatile operations on Power.  That cannot be understood 
> based on simple ordering constraints.
>
> As Stephan pointed out, there are similar issues on ARM, but they're 
> less commonly encountered in a Java implementation.  If you're lucky, 
> you can get to the right implementation recipe by looking at only 
> reordering, I think.
>
>
> On Tue, Nov 25, 2014 at 4:36 PM, David Holmes 
> <davidcholmes at aapt.net.au <mailto:davidcholmes at aapt.net.au>> wrote:
>
>     Stephan Diestelhorst writes:
>     >
>     > David Holmes wrote:
>     > > Stephan Diestelhorst writes:
>     > > > Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm:
>     > > > > I'm no hardware architect, but fundamentally it seems to
>     me that
>     > > > >
>     > > > > load x
>     > > > > acquire_fence
>     > > > >
>     > > > > imposes a much more stringent constraint than
>     > > > >
>     > > > > load_acquire x
>     > > > >
>     > > > > Consider the case in which the load from x is an L1 hit, but a
>     > > > > preceding load (from say y) is a long-latency miss.  If we
>     enforce
>     > > > > ordering by just waiting for completion of prior
>     operation, the
>     > > > > former has to wait for the load from y to complete; while the
>     > > > > latter doesn't.  I find it hard to believe that this
>     doesn't leave
>     > > > > an appreciable amount of performance on the table, at
>     least for
>     > > > > some interesting microarchitectures.
>     > > >
>     > > > I agree, Hans, that this is a reasonable assumption. 
>     Load_acquire x
>     > > > does allow roach motel, whereas the acquire fence does not.
>     > > >
>     > > > >  In addition, for better or worse, fencing requirements on
>     at least
>     > > > >  Power are actually driven as much by store atomicity
>     issues, as by
>     > > > >  the ordering issues discussed in the cookbook.  This was not
>     > > > >  understood in 2005, and unfortunately doesn't seem to be
>     > amenable to
>     > > > >  the kind of straightforward explanation as in Doug's
>     cookbook.
>     > > >
>     > > > Coming from a strongly ordered architecture to a weakly
>     ordered one
>     > > > myself, I also needed some mental adjustment about store
>     (multi-copy)
>     > > > atomicity.  I can imagine others will be unaware of this
>     difference,
>     > > > too, even in 2014.
>     > >
>     > > Sorry I'm missing the connection between fences and multi-copy
>     > atomicity.
>     >
>     > One example is the classic IRIW.  With non-multi copy atomic
>     stores, but
>     > ordered (say through a dependency) loads in the following example:
>     >
>     > Memory: foo = bar = 0
>     > _T1_         _T2_         _T3_         _T4_
>     > st (foo),1   st (bar),1   ld r1, (bar)         ld r3,(foo)
>     >                           <addr dep / local "fence" here> 
>      <addr dep>
>     >                           ld r2, (foo)         ld r4, (bar)
>     >
>     > You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on non-multi-copy
>     atomic
>     > machines.  On TSO boxes, this is not possible. That means that the
>     > memory fence that will prevent such a behaviour (DMB on ARM)
>     needs to
>     > carry some additional oomph in ensuring multi-copy atomicity, or
>     rather
>     > prevent you from seeing it (which is the same thing).
>
>     I take it as given that any code for which you may have ordering
>     constraints, must first have basic atomicity properties for loads and
>     stores. I would not expect any kind of fence to add
>     multi-copy-atomicity
>     where there was none.
>
>     David
>
>     > Stephan
>     >
>     > _______________________________________________
>     > Concurrency-interest mailing list
>     > Concurrency-interest at cs.oswego.edu
>     <mailto:Concurrency-interest at cs.oswego.edu>
>     > http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>     _______________________________________________
>     Concurrency-interest mailing list
>     Concurrency-interest at cs.oswego.edu
>     <mailto:Concurrency-interest at cs.oswego.edu>
>     http://cs.oswego.edu/mailman/listinfo/concurrency-interest
>
>
>
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest