[jmm-dev] Jmm revision status

Sat Jul 19 18:25:09 UTC 2014

In my mind, the main issues are the fact that the ARM fence workaround
isn't needed for Java, or for ARMv8, or for the other architectures that
Olivier worries about.

Although I'm sure there are implementations that will decide that
performance is more important than correctness in this area, I'd be
inclined to ignore those for this discussion. Otherwise we get into messy
issues of occurrence frequency that I think it's really hard to get a
handle on.  As a wild guess, and to illustrate the problem, I conjecture
that the ARM erratum is actually more of a practical issue than if we
promised load->store ordering, and just declared violations to be processor
bugs.  I'm having a hard time coming up with programming idioms that rely
on non-dependent load->store ordering.  On the other hand, as Paul pointed
out in the original C++ discussions, unexpected load x -> load x reordering
causes all sorts of problems.  (I agree that CSE statistically helps,
especially in the egregious cases.  But I think it often won't suffice, due
to compilation unit boundaries, and because it's dangerous and generally
discouraged to merge atomic accesses across loop iterations.)

Hans

On Sat, Jul 19, 2014 at 12:30 AM, Peter Sewell <Peter.Sewell at cl.cam.ac.uk>
wrote:

> On 18 July 2014 07:19, Hans Boehm <boehm at acm.org> wrote:
> >
> >> On Thu, Jul 17, 2014 at 10:43 PM, Peter Sewell <
> Peter.Sewell at cl.cam.ac.uk>
> >> wrote:
> >> >
> >> > On 18 July 2014 00:57, Hans Boehm <boehm at acm.org> wrote:
> >> > > A few other updates:  Brian and I had a paper in MSPC 14
> >> > > (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes
> the
> >> > > out-of-thin-air issues and solutions based on prohibiting
> store->load
> >> > > reordering.  I would argue that those are still the most practical
> >> > > solutions
> >> > > we currently have.
> >> > >
> >> > > One of my colleagues at Google points out that my earlier fear that
> >> > > bogus
> >> > > branches needed to enforce load->store ordering would tie up branch
> >> > > prediction resources should be unfounded.  It should be easy to
> >> > > arrange for
> >> > > these branches to be statically predicted correctly, in which case
> it
> >> > > appears that no prediction resources are used.
> >> >
> >> > > I think we still need real measurements of the cost
> >> >
> >> > Agreed for the last point.  For C I'm a bit skeptical; for Java I
> >> > wouldn't like to even guess.
> >
> > For C, if we look only at existing implementations, it seems that the
> only
> > cost is prohibiting some compiler transformations on relaxed operations,
> and
> > the cost on 64-bit ARMv8.  The former seems trivial; I suspect most
> > compilers don't reorder atomic accesses anyway.
>
> For the optimisation cost, I think Francesco et al. are starting to
> see optimisations involving atomics, but I agree that for relaxed
> there shouldn't (in principle) be much cost from forbidding them.
>
> > For Java, I agree.
> >
> >> >
> >> > > , which, at least for
> >> > > Java, I would expect to greatly depend on the cleverness of the
> >> > > compiler in
> >> > > delaying branches and avoiding unnecessary ones.
> >> > >
> >> > > Torvald Riegel and Paul McKenney are trying to turn C++11/C11
> >> > > memory_order_consume into something useful, and have been running
> into
> >> > > some
> >> > > of the same problems with definition of dependencies as we have
> here.
> >> >
> >> > There's also a bit of a question right now about "fake" data and
> >> > control dependency preservation on ARM; hopefully that will become
> >> > clear soon.
> >>
> >> > Although at most marginally relevant for Java, we also became aware of
> >> > an
> >> > ARM erratum
> >> >
> >> > (
> http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf
> ,
> >> > perhaps discovered by some of the other participants here?)
> >>
> >> (y)
> >>
> >> >, that seems to
> >> > effectively reduce the cost of prohibiting load->store reordering on
> >> > ARMv7
> >> > for C++ memory_order_relaxed to zero.  Apparently a substantial
> fraction
> >> > of
> >> > ARMv7 cores have a hardware erratum that requires a fence for
> >> > memory_order_relaxed loads anyway.  Otherwise loads from the same
> >> > location
> >> > may be reordered, which is disallowed for C++ memory_order_relaxed,
> but
> >> > allowed for Java.  Thus any object code that is intended to correctly
> >> > support memory_order_relaxed on these processors should already
> prohibit
> >> > load->store reordering as a side-effect.  For C and C++, I expect that
> >> > realistically applies to all 32-bit ARM code.  Unfortunately, the
> >> > required
> >> > workaround seems appreciably more expensive than what we would need to
> >> > just
> >> > enforce load->store ordering, since it needs an actual fence.
> >>
> >> I do wonder how widely that workaround is actually deployed - any data?
> >
> >
> > I suspect it's not.  But I think our task is to look at performance in a
> > currently hypothetical world where implementations are actually correct
> in
> > this respect, and where we no longer see random memory-model induced
> > failures and attribute them to alpha particles, or whatever.
>
> yes, but we also need a reasonable path towards that world.  If ARM
> compiler implementations are not going to take the cost of that
> workaround (eg because the coherence problem is sufficiently rare in
> practice - btw, amusingly, it seemed as if compiler optimisations like
> CSE might actually ameliorate the problem), then it still becomes
> difficult to argue that they should add load->store fencing for
> C-relaxed or Java-nonvolatile reads.
>
> > I think we're
> > gradually moving towards that hypothetical world, but we're not that
> close,
> > yet.  (I would be surprised if there were any real large systems for
> which
> > this ARM bug is the most common cause of memory-model-related failures.)
> >
> > Hans
> >
> >>
> >> > As mentioned, this does not directly change the Java situation.  It
> also
> >> > does not affect 64-bit executables intended to run on ARMv8.
> >>
> >> Indeed
> >> best,
> >> Peter
> >>
> >>
> >> > Hans
> >> >
> >>
> >
>