[concurrency-interest] RFR: 8065804: JEP171:Clarifications/corrections for fence intrinsics

Wed Nov 26 00:36:12 UTC 2014

Stephan Diestelhorst writes:
>
> David Holmes wrote:
> > Stephan Diestelhorst writes:
> > > Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm:
> > > > I'm no hardware architect, but fundamentally it seems to me that
> > > >
> > > > load x
> > > > acquire_fence
> > > >
> > > > imposes a much more stringent constraint than
> > > >
> > > > load_acquire x
> > > >
> > > > Consider the case in which the load from x is an L1 hit, but a
> > > > preceding load (from say y) is a long-latency miss.  If we enforce
> > > > ordering by just waiting for completion of prior operation, the
> > > > former has to wait for the load from y to complete; while the
> > > > latter doesn't.  I find it hard to believe that this doesn't leave
> > > > an appreciable amount of performance on the table, at least for
> > > > some interesting microarchitectures.
> > >
> > > I agree, Hans, that this is a reasonable assumption.  Load_acquire x
> > > does allow roach motel, whereas the acquire fence does not.
> > >
> > > >  In addition, for better or worse, fencing requirements on at least
> > > >  Power are actually driven as much by store atomicity issues, as by
> > > >  the ordering issues discussed in the cookbook.  This was not
> > > >  understood in 2005, and unfortunately doesn't seem to be
> amenable to
> > > >  the kind of straightforward explanation as in Doug's cookbook.
> > >
> > > Coming from a strongly ordered architecture to a weakly ordered one
> > > myself, I also needed some mental adjustment about store (multi-copy)
> > > atomicity.  I can imagine others will be unaware of this difference,
> > > too, even in 2014.
> >
> > Sorry I'm missing the connection between fences and multi-copy
> atomicity.
>
> One example is the classic IRIW.  With non-multi copy atomic stores, but
> ordered (say through a dependency) loads in the following example:
>
> Memory: foo = bar = 0
> _T1_         _T2_         _T3_                              _T4_
> st (foo),1   st (bar),1   ld r1, (bar)                      ld r3,(foo)
>                           <addr dep / local "fence" here>   <addr dep>
>                           ld r2, (foo)                      ld r4, (bar)
>
> You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on non-multi-copy atomic
> machines.  On TSO boxes, this is not possible.  That means that the
> memory fence that will prevent such a behaviour (DMB on ARM) needs to
> carry some additional oomph in ensuring multi-copy atomicity, or rather
> prevent you from seeing it (which is the same thing).

I take it as given that any code for which you may have ordering
constraints, must first have basic atomicity properties for loads and
stores. I would not expect any kind of fence to add multi-copy-atomicity
where there was none.

David

> Stephan
>
> _______________________________________________
> Concurrency-interest mailing list
> Concurrency-interest at cs.oswego.edu
> http://cs.oswego.edu/mailman/listinfo/concurrency-interest