[concurrency-interest] RFR: 8065804: JEP171:Clarifications/corrections for fence intrinsics

Tue Dec 9 23:08:32 UTC 2014

Yes you are right - forcing global visibility does ensure ordering.

David
  -----Original Message-----
  From: Oleksandr Otenko [mailto:oleksandr.otenko at oracle.com]
  Sent: Wednesday, 10 December 2014 8:59 AM
  To: dholmes at ieee.org; Hans Boehm
  Cc: core-libs-dev; concurrency-interest at cs.oswego.edu
  Subject: Re: [concurrency-interest] RFR: 8065804:
JEP171:Clarifications/corrections for fence intrinsics

  I see it differently. The issue is ordering - the inability of non-TSO
platforms enforce total order of independent stores. The first loads are
also independent and their ordering can neither be enforced, nor detected.
But the following load can detect the lack of total ordering of stores and
loads, so it is enforced through a heavyweight barrier.

  But I understood now why other barriers won't work. Thank you.

  Alex

  On 09/12/2014 21:59, David Holmes wrote:

    In this case the issue is not ordering per-se (which is what
dependencies help with) but global visibility. After performing the first
read each thread must ensure that its second read will return what the other
thread saw for the first read - hence a full dmb/sync between the reads; or
generalizing a full dmb/sync after every volatile read.

    David
      -----Original Message-----
      From: Oleksandr Otenko [mailto:oleksandr.otenko at oracle.com]
      Sent: Wednesday, 10 December 2014 7:54 AM
      To: dholmes at ieee.org; Hans Boehm
      Cc: core-libs-dev; concurrency-interest at cs.oswego.edu
      Subject: Re: [concurrency-interest] RFR: 8065804:
JEP171:Clarifications/corrections for fence intrinsics

      Yes, I do understand the reader needs barriers, too. I guess I was
wondering more why the reader would need something stronger than what
dependencies etc could enforce. I guess I'll read what Martin forwarded
first.

      Alex

      On 09/12/2014 21:37, David Holmes wrote:

        See my earlier response to Martin. The reader has to force a
consistent view of memory - the writer can't as the write escapes before it
can issue the barrier.

        David
          -----Original Message-----
          From: concurrency-interest-bounces at cs.oswego.edu
[mailto:concurrency-interest-bounces at cs.oswego.edu]On Behalf Of Oleksandr
Otenko
          Sent: Wednesday, 10 December 2014 6:04 AM
          To: Hans Boehm; dholmes at ieee.org
          Cc: core-libs-dev; concurrency-interest at cs.oswego.edu
          Subject: Re: [concurrency-interest] RFR: 8065804:
JEP171:Clarifications/corrections for fence intrinsics

          On 26/11/2014 02:04, Hans Boehm wrote:

            To be concrete here, on Power, loads can normally be ordered by
an address dependency or light-weight fence (lwsync).  However, neither is
enough to prevent the questionable outcome for IRIW, since it doesn't ensure
that the stores in T1 and T2 will be made visible to other threads in a
consistent order.  That outcome can be prevented by using heavyweight fences
(sync) instructions between the loads instead.

          Why would they need fences between loads instead of syncing the
order of stores?

          Alex

            Peter Sewell's group concluded that to enforce correct volatile
behavior on Power, you essentially need a a heavyweight fence between every
pair of volatile operations on Power.  That cannot be understood based on
simple ordering constraints.

            As Stephan pointed out, there are similar issues on ARM, but
they're less commonly encountered in a Java implementation.  If you're
lucky, you can get to the right implementation recipe by looking at only
reordering, I think.

            On Tue, Nov 25, 2014 at 4:36 PM, David Holmes
<davidcholmes at aapt.net.au> wrote:

              Stephan Diestelhorst writes:
              >
              > David Holmes wrote:
              > > Stephan Diestelhorst writes:
              > > > Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans
Boehm:
              > > > > I'm no hardware architect, but fundamentally it seems
to me that
              > > > >
              > > > > load x
              > > > > acquire_fence
              > > > >
              > > > > imposes a much more stringent constraint than
              > > > >
              > > > > load_acquire x
              > > > >
              > > > > Consider the case in which the load from x is an L1
hit, but a
              > > > > preceding load (from say y) is a long-latency miss.
If we enforce
              > > > > ordering by just waiting for completion of prior
operation, the
              > > > > former has to wait for the load from y to complete;
while the
              > > > > latter doesn't.  I find it hard to believe that this
doesn't leave
              > > > > an appreciable amount of performance on the table, at
least for
              > > > > some interesting microarchitectures.
              > > >
              > > > I agree, Hans, that this is a reasonable assumption.
Load_acquire x
              > > > does allow roach motel, whereas the acquire fence does
not.
              > > >
              > > > >  In addition, for better or worse, fencing
requirements on at least
              > > > >  Power are actually driven as much by store atomicity
issues, as by
              > > > >  the ordering issues discussed in the cookbook.  This
was not
              > > > >  understood in 2005, and unfortunately doesn't seem to
be
              > amenable to
              > > > >  the kind of straightforward explanation as in Doug's
cookbook.
              > > >
              > > > Coming from a strongly ordered architecture to a weakly
ordered one
              > > > myself, I also needed some mental adjustment about store
(multi-copy)
              > > > atomicity.  I can imagine others will be unaware of this
difference,
              > > > too, even in 2014.
              > >
              > > Sorry I'm missing the connection between fences and
multi-copy
              > atomicity.
              >
              > One example is the classic IRIW.  With non-multi copy atomic
stores, but
              > ordered (say through a dependency) loads in the following
example:
              >
              > Memory: foo = bar = 0
              > _T1_         _T2_         _T3_
_T4_
              > st (foo),1   st (bar),1   ld r1, (bar)
ld r3,(foo)
              >                           <addr dep / local "fence" here>
<addr dep>
              >                           ld r2, (foo)
ld r4, (bar)
              >
              > You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on
non-multi-copy atomic
              > machines.  On TSO boxes, this is not possible.  That means
that the
              > memory fence that will prevent such a behaviour (DMB on ARM)
needs to
              > carry some additional oomph in ensuring multi-copy
atomicity, or rather
              > prevent you from seeing it (which is the same thing).

              I take it as given that any code for which you may have
ordering
              constraints, must first have basic atomicity properties for
loads and
              stores. I would not expect any kind of fence to add
multi-copy-atomicity
              where there was none.

              David

              > Stephan
              >
              > _______________________________________________
              > Concurrency-interest mailing list
              > Concurrency-interest at cs.oswego.edu
              > http://cs.oswego.edu/mailman/listinfo/concurrency-interest

              _______________________________________________
              Concurrency-interest mailing list
              Concurrency-interest at cs.oswego.edu
              http://cs.oswego.edu/mailman/listinfo/concurrency-interest

_______________________________________________
Concurrency-interest mailing list
Concurrency-interest at cs.oswego.edu
http://cs.oswego.edu/mailman/listinfo/concurrency-interest