[jmm-dev] Jmm revision status

Sat Jul 19 07:30:31 UTC 2014

On 18 July 2014 07:19, Hans Boehm <boehm at acm.org> wrote:
>
>> On Thu, Jul 17, 2014 at 10:43 PM, Peter Sewell <Peter.Sewell at cl.cam.ac.uk>
>> wrote:
>> >
>> > On 18 July 2014 00:57, Hans Boehm <boehm at acm.org> wrote:
>> > > A few other updates:  Brian and I had a paper in MSPC 14
>> > > (http://dl.acm.org/citation.cfm?id=2618134) that mostly summarizes the
>> > > out-of-thin-air issues and solutions based on prohibiting store->load
>> > > reordering.  I would argue that those are still the most practical
>> > > solutions
>> > > we currently have.
>> > >
>> > > One of my colleagues at Google points out that my earlier fear that
>> > > bogus
>> > > branches needed to enforce load->store ordering would tie up branch
>> > > prediction resources should be unfounded.  It should be easy to
>> > > arrange for
>> > > these branches to be statically predicted correctly, in which case it
>> > > appears that no prediction resources are used.
>> >
>> > > I think we still need real measurements of the cost
>> >
>> > Agreed for the last point.  For C I'm a bit skeptical; for Java I
>> > wouldn't like to even guess.
>
> For C, if we look only at existing implementations, it seems that the only
> cost is prohibiting some compiler transformations on relaxed operations, and
> the cost on 64-bit ARMv8.  The former seems trivial; I suspect most
> compilers don't reorder atomic accesses anyway.

For the optimisation cost, I think Francesco et al. are starting to
see optimisations involving atomics, but I agree that for relaxed
there shouldn't (in principle) be much cost from forbidding them.

> For Java, I agree.
>
>> >
>> > > , which, at least for
>> > > Java, I would expect to greatly depend on the cleverness of the
>> > > compiler in
>> > > delaying branches and avoiding unnecessary ones.
>> > >
>> > > Torvald Riegel and Paul McKenney are trying to turn C++11/C11
>> > > memory_order_consume into something useful, and have been running into
>> > > some
>> > > of the same problems with definition of dependencies as we have here.
>> >
>> > There's also a bit of a question right now about "fake" data and
>> > control dependency preservation on ARM; hopefully that will become
>> > clear soon.
>>
>> > Although at most marginally relevant for Java, we also became aware of
>> > an
>> > ARM erratum
>> >
>> > (http://infocenter.arm.com/help/topic/com.arm.doc.uan0004a/UAN0004A_a9_read_read.pdf,
>> > perhaps discovered by some of the other participants here?)
>>
>> (y)
>>
>> >, that seems to
>> > effectively reduce the cost of prohibiting load->store reordering on
>> > ARMv7
>> > for C++ memory_order_relaxed to zero.  Apparently a substantial fraction
>> > of
>> > ARMv7 cores have a hardware erratum that requires a fence for
>> > memory_order_relaxed loads anyway.  Otherwise loads from the same
>> > location
>> > may be reordered, which is disallowed for C++ memory_order_relaxed, but
>> > allowed for Java.  Thus any object code that is intended to correctly
>> > support memory_order_relaxed on these processors should already prohibit
>> > load->store reordering as a side-effect.  For C and C++, I expect that
>> > realistically applies to all 32-bit ARM code.  Unfortunately, the
>> > required
>> > workaround seems appreciably more expensive than what we would need to
>> > just
>> > enforce load->store ordering, since it needs an actual fence.
>>
>> I do wonder how widely that workaround is actually deployed - any data?
>
>
> I suspect it's not.  But I think our task is to look at performance in a
> currently hypothetical world where implementations are actually correct in
> this respect, and where we no longer see random memory-model induced
> failures and attribute them to alpha particles, or whatever.

yes, but we also need a reasonable path towards that world.  If ARM
compiler implementations are not going to take the cost of that
workaround (eg because the coherence problem is sufficiently rare in
practice - btw, amusingly, it seemed as if compiler optimisations like
CSE might actually ameliorate the problem), then it still becomes
difficult to argue that they should add load->store fencing for
C-relaxed or Java-nonvolatile reads.

> I think we're
> gradually moving towards that hypothetical world, but we're not that close,
> yet.  (I would be surprised if there were any real large systems for which
> this ARM bug is the most common cause of memory-model-related failures.)
>
> Hans
>
>>
>> > As mentioned, this does not directly change the Java situation.  It also
>> > does not affect 64-bit executables intended to run on ARMv8.
>>
>> Indeed
>> best,
>> Peter
>>
>>
>> > Hans
>> >
>>
>