[jmm-dev] Store completion query - general and ARM

Thu Nov 10 09:54:00 UTC 2016

If you want a more rigorous and concrete model that explains this, you
might want to look at:
http://www.cl.cam.ac.uk/~pes20/popl16-armv8/top.pdf

The associated tool: www.cl.cam.ac.uk/~pes20/AArch64/
lets one run arbitrary model-allowed executions of tests:
- select AArch64 test Tutorial/SB+dmb.sys
- under Options, select "a larger set of transitions which we proved can be
taken eagerly"
- click Run, then it shows the initial state of the model with the initial
transitions highlighted in green
- take all the thread-local transitions of each thread (five each)
- now you can see each thread's write, dmb, and read request in the
"flowing model" abstract storage subsystem
- in this model, the dmb sys keeps the write and read request in order as
they flow down to memory, so no interleaving of the possible model
transitions can break the Dekker's algorithm property.

In this example, it happens that the read requests also can't be satisfied
from writes that haven't hit main memory, but in general they can be
satisfied earlier.

For contrast, if you try the SB test without dmb, you'll many more possible
executions.

This flowing model is actually a bit more microarchitectural than one would
like for an architectural spec, as it exposes the abstract interconnect
topology.  The POP model, also provided by that tool, abstracts from the
topology.  Both are principally due to Shaked Flur, cc'd.

best,
Peter

On 10 November 2016 at 09:31, Andrew Haley <aph at redhat.com> wrote:

> On 10/11/16 09:20, David Holmes wrote:
> > On 10/11/2016 7:07 PM, Andrew Haley wrote:
> >> On 10/11/16 00:06, David Holmes wrote:
> >>> Does any part of the JMM require actual visibility/completion of
> >>> volatile stores or is it only order that is defined (with an
> assumptions
> >>> that all stores will complete in a finite time)?
> >>
> >> Ordering is really all that we've got: all that memory fences can do
> >> is ensure that visibility of loads and stores is ordered in some way.
> >
> > If we establish some global order of loads and stores, yes. That can in
> > turn require that a store become visible prior to a given load.
>
> I agree.
>
> >>> In relation to ARM specifically, Dekker style algorithms require
> >>> visibility/completion of the store before the subsequent load, yet the
> >>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory
> >>> Models" shows the use of DMB, not DSB.
> >>
> >> DMB is fine for that.  Dekker doesn't need a store to be forced out of
> >> the caches, only that the store be made visible to other processors
> >> before any operations later in program order.
> >
> > Again it is far from obvious to me that DMB causes the store to be
> > visible before any operations later in program order. I find the Group A
> > / Group B formulation (and even the definition of "observe") to be quite
> > obscure and hard to map to actual code behaviour.
>
> Indeed.  The real problem is that ARM are trying to describe the
> memory model in an abstract way that does not overly constrain
> implementations.  But a DMB really is sufficient to ensure that prior
> stores are visible.  (Mind you, we don't need DMB to get sequentially-
> consistent behaviour that's enough for Java volatiles.)
>
> >>> Yet AFAICS DMB says nothing about completion whereas DSB does. ??
> >>> (To be honest I find the Group A/B description of DMB properties
> >>> extremely hard to actually interpret wrt code like Dekker.)
> >>
> >> DSB is only really needed if there are multiple caches of the same
> >> address, i.e. Icache and Dcache: it's necessary to force a store out
> >> into main memory in order to refresh he Icache.
> >
> > I thought only ISB had an effect relative to instructions/i-cache ??
>
> It does: you need DSB to ensure the visibility of the data cleaned
> from the Dcache, then ISB to synchronize the fetched instruction
> stream.
>
> Andrew.
>
>
>