[concurrency-interest] RFR: 8065804: JEP171:Clarifications/corrections for fence intrinsics

Tue Dec 9 21:37:55 UTC 2014

See my earlier response to Martin. The reader has to force a consistent view
of memory - the writer can't as the write escapes before it can issue the
barrier.

David
  -----Original Message-----
  From: concurrency-interest-bounces at cs.oswego.edu
[mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of Oleksandr
Otenko
  Sent: Wednesday, 10 December 2014 6:04 AM
  To: Hans Boehm; dholmes at ieee.org
  Cc: core-libs-dev; concurrency-interest at cs.oswego.edu
  Subject: Re: [concurrency-interest] RFR: 8065804:
JEP171:Clarifications/corrections for fence intrinsics

  On 26/11/2014 02:04, Hans Boehm wrote:

    To be concrete here, on Power, loads can normally be ordered by an
address dependency or light-weight fence (lwsync).  However, neither is
enough to prevent the questionable outcome for IRIW, since it doesn't ensure
that the stores in T1 and T2 will be made visible to other threads in a
consistent order.  That outcome can be prevented by using heavyweight fences
(sync) instructions between the loads instead.

  Why would they need fences between loads instead of syncing the order of
stores?

  Alex

    Peter Sewell's group concluded that to enforce correct volatile behavior
on Power, you essentially need a a heavyweight fence between every pair of
volatile operations on Power.  That cannot be understood based on simple
ordering constraints.

    As Stephan pointed out, there are similar issues on ARM, but they're
less commonly encountered in a Java implementation.  If you're lucky, you
can get to the right implementation recipe by looking at only reordering, I
think.

    On Tue, Nov 25, 2014 at 4:36 PM, David Holmes <davidcholmes at aapt.net.au>
wrote:

      Stephan Diestelhorst writes:
      >
      > David Holmes wrote:
      > > Stephan Diestelhorst writes:
      > > > Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm:
      > > > > I'm no hardware architect, but fundamentally it seems to me
that
      > > > >
      > > > > load x
      > > > > acquire_fence
      > > > >
      > > > > imposes a much more stringent constraint than
      > > > >
      > > > > load_acquire x
      > > > >
      > > > > Consider the case in which the load from x is an L1 hit, but a
      > > > > preceding load (from say y) is a long-latency miss.  If we
enforce
      > > > > ordering by just waiting for completion of prior operation,
the
      > > > > former has to wait for the load from y to complete; while the
      > > > > latter doesn't.  I find it hard to believe that this doesn't
leave
      > > > > an appreciable amount of performance on the table, at least
for
      > > > > some interesting microarchitectures.
      > > >
      > > > I agree, Hans, that this is a reasonable assumption.
Load_acquire x
      > > > does allow roach motel, whereas the acquire fence does not.
      > > >
      > > > >  In addition, for better or worse, fencing requirements on at
least
      > > > >  Power are actually driven as much by store atomicity issues,
as by
      > > > >  the ordering issues discussed in the cookbook.  This was not
      > > > >  understood in 2005, and unfortunately doesn't seem to be
      > amenable to
      > > > >  the kind of straightforward explanation as in Doug's
cookbook.
      > > >
      > > > Coming from a strongly ordered architecture to a weakly ordered
one
      > > > myself, I also needed some mental adjustment about store
(multi-copy)
      > > > atomicity.  I can imagine others will be unaware of this
difference,
      > > > too, even in 2014.
      > >
      > > Sorry I'm missing the connection between fences and multi-copy
      > atomicity.
      >
      > One example is the classic IRIW.  With non-multi copy atomic stores,
but
      > ordered (say through a dependency) loads in the following example:
      >
      > Memory: foo = bar = 0
      > _T1_         _T2_         _T3_                              _T4_
      > st (foo),1   st (bar),1   ld r1, (bar)                      ld
r3,(foo)
      >                           <addr dep / local "fence" here>   <addr
dep>
      >                           ld r2, (foo)                      ld r4,
(bar)
      >
      > You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on non-multi-copy
atomic
      > machines.  On TSO boxes, this is not possible.  That means that the
      > memory fence that will prevent such a behaviour (DMB on ARM) needs
to
      > carry some additional oomph in ensuring multi-copy atomicity, or
rather
      > prevent you from seeing it (which is the same thing).

      I take it as given that any code for which you may have ordering
      constraints, must first have basic atomicity properties for loads and
      stores. I would not expect any kind of fence to add
multi-copy-atomicity
      where there was none.

      David

      > Stephan
      >
      > _______________________________________________
      > Concurrency-interest mailing list
      > Concurrency-interest at cs.oswego.edu
      > http://cs.oswego.edu/mailman/listinfo/concurrency-interest

      _______________________________________________
      Concurrency-interest mailing list
      Concurrency-interest at cs.oswego.edu
      http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest