From dl at cs.oswego.edu  Fri Feb  7 09:45:25 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Fri, 07 Feb 2014 12:45:25 -0500
Subject: [jmm-dev] Now playing on the OpenJDK jmm-dev list
Message-ID: <52F51BB5.6010100@cs.oswego.edu>


Here's the third, and I hope last, Intro post for this effort.

The goal of the OpenJDK Memory Model Update project
(http://openjdk.java.net/projects/jmm/) is to provide an updated Java
Memory Model, as described in JEP188
(http://openjdk.java.net/jeps/188). Probably, most results will be
posted on the OpenJDK Wiki
(https://wiki.openjdk.java.net/display/jmm/Main), for use in updating
JLS and as a reference for other JEPs producing associated software.

This mailing list is intended for developing an updated JMM, not for
usage questions etc about the JMM (for which
concurrency-interest at cs.oswego.edu and/or
javamemorymodel-discussion at cs.umd.edu may be more appropriate).  We
welcome participation by concurrency experts in formal specification,
hardware and software engineering, and software development tools.
Even though I'm listed as project lead, I hope never to have any
special role in development processes except for cat-herding efforts
to help direct attention to issues.

The idea of proposing JEP188 started with some informal exchanges that
grew to an unmanageable CC list upon proposal, so shifted to a
temporary mailing list while we waited for approval. There has already
been a fair amount of traffic (summarized below). Those on the
pre-jmm9 list should be auto-subscribed and receiving this as first
jmm-dev post.  To view pre-jmm9 mailing list echanges, see the mailman
archives (http://cs.oswego.edu/pipermail/pre-jmm9/), and for previous
CCed exchanges, see the gzipped mbox archive
(http://gee.cs.oswego.edu/dl/papers/preprejmm9.mbox.gz).

Here's a brief summary of in-progress efforts (updated from 3 weeks
ago). Many of them are only loosely coupled with each other, so I've
been encouraging people to explore them concurrently.

1. Objectives.  These are not yet written all in one place, but I
think we are heading to rough agreement about core issues: safety
(including disallowing OOTA (out-of-thin-air) reads), security
(including information flow via indirection), global properties
(including SC (sequential consistency) for DRF (data race free)
programs solely using locks, and sometimes other cases), and
expressiveness (enabling finer-grained ordering control required to
implement known shared-memory concurrent algorithms).

2. Formalisms.  Several people/groups are contemplating different
approaches. This is still in early stages, but I'm optimistic
about prospects for something really good to emerge.

3. Mappings. How do models translate into compiler and processor rules
or actions? Or to JLS specs?

4. Experimentation, including: (1) Do compilers (mainly, optimizers)
and processors do what we think/hope? (2) What are performance impacts
of proposed mappings?

5. Initialization. Pending lots of details and checks, we might have
settled on a simpler path for this that amounts to ensuring release
fences at the ends of constructors in a way that introduces no (or at
most little) additional performance impact.

6. Expression of ordering constraints. There seems to be no substantive
disagreement with the idea of supplying C/C++11-like methods offering
manual ordering control via a compination of enhanced l-value
operations (".volatile") and fence methods. Many details needed.

7. Implementation guidance. We have already seen cases where exploring
alternatives has led to some possible improvements in JVMs. Probably
much more to come, sometimes in the form of contributing patches.

8. User/usage validation. Do the results of this effort help
developers?  We have a lot of known usages and complaints built up
over the years to draw on before needing to invite more.

9. Consequences. Not really started yet: Can we propose tools,
annotations, test suites, etc. Also User guidance docs.  (Aside: The
Android cross-language "SMP guide" might be a good model for some
audiences (http://developer.android.com/training/articles/smp.html).

-Doug

From hansboehm at yahoo.com  Fri Feb  7 23:11:52 2014
From: hansboehm at yahoo.com (Hans Boehm)
Date: Fri, 7 Feb 2014 23:11:52 -0800 (PST)
Subject: [jmm-dev] [pre-jmm9] Expressing ordering constraints
In-Reply-To: <20140207184626.GU4250@linux.vnet.ibm.com>
Message-ID: <1391843512.43684.YahooMailBasic@web122205.mail.ne1.yahoo.com>

I'm not sure where the proposal to add br;isync came from.  I think BMM would require that, but that seems more drastic than what I've seen proposed here.  Even the N3710 ld->st ordering proposal is believed to require at most the branch without the isync.  (And, if adopted, I would expect that branch to be replaced by another instruction that doesn't consume branch prediction slots in a few years.)

Hans

--------------------------------------------
On Fri, 2/7/14, Paul E. McKenney <paulmck at linux.vnet.ibm.com> wrote:

 Subject: Re: [pre-jmm9] Expressing ordering constraints
 To: "Doug Lea" <dl at cs.oswego.edu>
 Cc: "pre-jmm9 at cs.oswego.edu" <pre-jmm9 at cs.oswego.edu>, "Boehm, Hans" <hans.boehm at hp.com>
 Date: Friday, February 7, 2014, 10:46 AM
 
 On Sat, Feb 01, 2014 at
 10:16:34AM -0500, Doug Lea wrote:
 > On
 01/30/2014 01:30 AM, Boehm, Hans wrote:
 >
 
 > >You mean that you would prefer
 intentionally racy but unordered accesses
 > >have the same semantics as data
 not-intended-to-be-racy accesses? ... The
 > >feeling on the C++ committee,
 particularly on Paul's part, IIRC, was that we
 > >did want coherence for the
 intentionally racy accesses, because its absence
 > >was just too horrible to deal with.
 > 
 > This might be one of
 my rare disagreements with Paul. In this case,
 > simplicity of rules seems worth the extra
 agony for people who can
 > figure out how
 to cope if they need to.? Without distinguishing these,
 > the parts of the JMM that most programmers
 would ever need to deal
 > with might look
 like something like the following. It's not quite
 > as simple as, say, BMM, but seems to be on
 the right track:
 
 It is not
 just me.? 
 
 The possibility
 of adding compare-branch-isb/isync to C11 relaxed loads
 came up on the Linux kernel mailing list today,
 and the reaction of one
 prominent maintainer
 was, and I quote, "sounds like someone took a big
 toke from the bong again."? One of the
 ARM64 maintainers took a somewhat
 less
 colorful but equally negative position, questioning why
 additional
 otherwise-unnecessary
 instructions were being contemplated to solve what
 he termed a compiler problem.
 
 The Linux-kernel discussion
 was of course C11 rather than Java, but
 nevertheless, just saying...
 

From dl at cs.oswego.edu  Sat Feb  8 06:17:44 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Sat, 08 Feb 2014 09:17:44 -0500
Subject: [jmm-dev] [pre-jmm9] Expressing ordering constraints
In-Reply-To: <1391843512.43684.YahooMailBasic@web122205.mail.ne1.yahoo.com>
References: <1391843512.43684.YahooMailBasic@web122205.mail.ne1.yahoo.com>
Message-ID: <52F63C88.7040800@cs.oswego.edu>

On 02/08/2014 02:11 AM, Hans Boehm wrote:
> I'm not sure where the proposal to add br;isync came from.

Me neither. Backing up ...

Because it would be vastly nicer in Java to equate semantics for
ordinary accesses to non-volatiles and relaxed accesses
to volatiles, I've been trying to further diagnose the basis
for the C/C++11 distinction and subsequent issues, to see
if we can avoid them. Hans's N3710
(http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html)
includes some discussions, but I'm still missing some context
because this was introduced after I stopped paying close
attention to C/C++11 MM development.

As Hans mentioned, in C/C++ atomics are required to be coherent
even if accessed in relaxed mode. Coherence requires, among other
things that you don't see values "running backwards in time", which
might otherwise occur in reorderings such as { r1 = x; r2 = x; }.
When you mix this with current C/C++11 OOTA loopholes you encounter
inexplicable anomalies. (See examples in N3710 and others discovered
by Brian Demsky).

Ignoring the OOTA issue (that I suspect that we address in some other
way), the decision about requiring coherence regardless of
access mode seems suspicious. Is there a killer example of an
otherwise unprogrammable  algorithm? An otherwise unprovable property?

If not, my current sense is "Shrug; if you want ordering, ask for it".
At most we could support more ordering control methods.
(It looks like we will already be adding something like
dependentStoreFence(ref) to those in C/C++, and there's no
reason not to contemplate others.)

-Doug


From dl at cs.oswego.edu  Sun Feb  9 07:31:09 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Sun, 09 Feb 2014 10:31:09 -0500
Subject: [jmm-dev] final reads/writes
Message-ID: <52F79F3D.5050802@cs.oswego.edu>

Continuing my quest to introduce issues early and often...

Assuming that we go ahead with the idea of ensuring a release fence
upon construction (normally free, because piggybacked with those
required anyway for object headers), rather than only in the presence
of final fields, do we need to say anything special about final fields?

I can't quite rule it out. Thoughts welcome.

Background: Optimizers like to remove unnecessary reads
(and computations based on them). It seems that any plausible
memory model will allow cases based on the idea that if you can
identify a readsFrom source for a value, and you've already
read it, then no additional ordering constraints could
force you to re-read, so don't.

In a more ideal world, "final" would allow a more aggressive
version: If you've  (implicitly) identified ANY readsFrom source,
that's good enough, because there is only one. Unfortunately, "final"
doesn't strictly mean this in JVMs -- there are cheats
sometimes allowing further updates to final variables. And in
practice JVMs are conservative enough to allow those cheats
to work, despite some wording in the JSR133 JMM allowing them
not to work.

Additionally, JDK8 hotspot introduced the @Stable annotation
that in essence says: if the value is nonnull, then it is the final
written value. And similar cases arise in which there may be
bookkeeping to track "Freeze after writing" status
(https://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR710),
and a possible JDK9 proposal for explicitly "frozen" arrays.

The question at hand is: Does a memory model itself need to say
anything explicitly about any of these?

-Doug


From ludwig.mark at siemens.com  Sun Feb  9 07:33:37 2014
From: ludwig.mark at siemens.com (Ludwig, Mark)
Date: Sun, 9 Feb 2014 15:33:37 +0000
Subject: [jmm-dev] Atomic references
Message-ID: <BC5672F8AD4C054BAF167C9801500D1AF5BD933C@USSLMMBX002.net.plm.eds.com>

Greetings,

This is probably in the category of "picking a nit," but I have yet to find any statement that shared Java variables that are object references are atomic.  (An example of a shared variable is a class static declaration.)

That is, I have the impression that a shared variable that is a reference to an object (these days, a 4- or 8-byte pointer at the hardware level for 32- or 64-bit architecture, respectively) is naturally atomic, that if I have two or more asynchronous Java threads assigning an object reference to a shared variable, or assigning null to a shared variable, that any other thread reading that reference will consistently read either null or one of the object references assigned along the way.

We believe this is true because of the need for such atomicity within the hardware, so it naturally provides this in the machine instructions that store and retrieve addresses (between registers and main memory).  This assumption might only hold for properly-aligned values in main memory, but we assume that Java provides this at the machine level, naturally, too.  ("Properly aligned" means that any address referring to a 4-byte address in memory has zeroes in the last two bits, and an 8-byte reference has zeroes in the last three bits.)

To be clear, I am /not/ talking about an 8-byte /anything/ on a 32-bit architecture.  I am also aware that, without synchronization, there is a timing window among the threads about what they read (that any thread may read out-of-date data for an indeterminate period of time according to the hardware caching architecture).  My point is not about timing, per se, but about self-consistency among the bytes comprising a reference in a shared variable.

We use this assumption heavily in a server application, and have yet to ever hear of any trouble or concern around this, but cannot find anything specifying this behavior.  While all you distinguished folks are updating the JMM, I thought you might cover this.  OTOH, if it /is/ specified, I'd appreciate a pointer to the language specifying it.

For background - and in case I haven't used terms above that precisely mean what I intend:

We use this technique for letting threads allocate a Singleton, with an idempotent construction sequence, that is accessed at very high frequency, without using any synchronization.  (Each thread looks for null and if it is, constructs the Singleton and assigns the reference.)  This makes sense to us when the code to construct the Singleton is cheap enough, and we have strictly limited the Singleton to final fields in order to use the existing JMM guarantee that when construction finishes, it's safe to let any thread pick up a reference to the object and use it asynchronously.  I use the label "very high frequency"  when accesses to the Singleton occur within each thread perhaps thousands of times per second on a fast-enough machine.  We believe it's cheaper to let every thread (at the worst case) construct the Singleton, and let the garbage collector take care of cleaning up the duplicates (if any), than it would be to synchronize around the reference for the life of the application.

The server application runs for indeterminate periods of time ... easily months, depending on scheduled down-time.  It creates threads for client actions.  (We write business software in use by an unknown number of customers.)  At large customer sites, easily millions of threads go through this code while the application is running.  Such sites have large processor complexes.  We know one customer has 64 processors in a large server machine.

The fact that threads are created to service client requests also means that the number of competing threads during application start-up is limited to a fairly small number.  I would be surprised if any customer could manage to get even ten (10) client actions running concurrently that might all construct the Singleton.  At the peak of the working day, once there are thousands of users sending requests to the server, it's reasonable to expect that every processor is reading the reference to the Singleton perhaps thousands of times per second.  (We know from scalability testing that synchronizing around such a reference imposes a noticeable bottleneck.)

Thanks!

Mark Ludwig
Lifecycle Coll
Product Lifecycle Management

Siemens Industry Sector
Siemens Product Lifecycle Management Software Inc.
5939 Rice Creek Parkway
Shoreview, MN  55126 United States
Tel.      :+1 (651) 855-6140
Fax      :+1 (651) 855-6280
ludwig.mark at siemens.com <ludwig.mark at siemens.com%20>
www.siemens.com/plm


From dl at cs.oswego.edu  Sun Feb  9 11:57:36 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Sun, 09 Feb 2014 14:57:36 -0500
Subject: [jmm-dev] Atomic references
In-Reply-To: <BC5672F8AD4C054BAF167C9801500D1AF5BD933C@USSLMMBX002.net.plm.eds.com>
References: <BC5672F8AD4C054BAF167C9801500D1AF5BD933C@USSLMMBX002.net.plm.eds.com>
Message-ID: <52F7DDB0.8040503@cs.oswego.edu>

On 02/09/2014 10:33 AM, Ludwig, Mark wrote:
> This is probably in the category of "picking a nit," but I have yet to find
> any statement that shared Java variables that are object references are
> atomic.

Yes, they must be. The statement is hiding in JLS sec 17.7

"Writes to and reads of references are always atomic, regardless of whether they 
are implemented as 32-bit or 64-bit values."

http://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.7

(Curiously, there is no direct statement that ints, shorts, chars,
or bytes are atomic. This probably needs fixing.)

-Doug


From jeremymanson at google.com  Sun Feb  9 16:34:32 2014
From: jeremymanson at google.com (Jeremy Manson)
Date: Sun, 9 Feb 2014 16:34:32 -0800
Subject: [jmm-dev] final reads/writes
In-Reply-To: <52F79F3D.5050802@cs.oswego.edu>
References: <52F79F3D.5050802@cs.oswego.edu>
Message-ID: <CAPYFHW3FAX0pbppV9tL55E+X-j1znNY+7VvPdvv7rYq=WEWhUQ@mail.gmail.com>

So, to reiterate my previous postings (too): I used to believe that a
memory model should provide extra latitude for optimizations in these
cases, but now I'm perfectly happy with not saying anything.  We do need to
talk about the safety guarantees, and anything that VM developers can do
given those safety guarantees is fine.

My mind was changed when we tried hoisting final loads in Hotspot, and got
a percent or so speedup, but threw some code into infinite loops.

I think it is just too confusing for users if they change the value of a
field, but subsequent loads - regular, same-thread loads - don't see the
changed value.

Compilers should only do a loop invariant hoist if they can prove that the
value being hoisted is loop invariant.

Now, if you can prove that the value doesn't change, then you can certainly
do the optimization.  But in that case, the final annotation is a hint, not
a proof.  I'm not sure that the JMM needs to say anything about that.

Jeremy


On Sun, Feb 9, 2014 at 7:31 AM, Doug Lea <dl at cs.oswego.edu> wrote:

> Continuing my quest to introduce issues early and often...
>
> Assuming that we go ahead with the idea of ensuring a release fence
> upon construction (normally free, because piggybacked with those
> required anyway for object headers), rather than only in the presence
> of final fields, do we need to say anything special about final fields?
>
> I can't quite rule it out. Thoughts welcome.
>
> Background: Optimizers like to remove unnecessary reads
> (and computations based on them). It seems that any plausible
> memory model will allow cases based on the idea that if you can
> identify a readsFrom source for a value, and you've already
> read it, then no additional ordering constraints could
> force you to re-read, so don't.
>
> In a more ideal world, "final" would allow a more aggressive
> version: If you've  (implicitly) identified ANY readsFrom source,
> that's good enough, because there is only one. Unfortunately, "final"
> doesn't strictly mean this in JVMs -- there are cheats
> sometimes allowing further updates to final variables. And in
> practice JVMs are conservative enough to allow those cheats
> to work, despite some wording in the JSR133 JMM allowing them
> not to work.
>
> Additionally, JDK8 hotspot introduced the @Stable annotation
> that in essence says: if the value is nonnull, then it is the final
> written value. And similar cases arise in which there may be
> bookkeeping to track "Freeze after writing" status
> (https://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR710),
> and a possible JDK9 proposal for explicitly "frozen" arrays.
>
> The question at hand is: Does a memory model itself need to say
> anything explicitly about any of these?
>
> -Doug
>
>

From aleksey.shipilev at oracle.com  Mon Feb 10 11:18:25 2014
From: aleksey.shipilev at oracle.com (Aleksey Shipilev)
Date: Mon, 10 Feb 2014 23:18:25 +0400
Subject: [jmm-dev] Enforcing access atomicity (benchmarks)
Message-ID: <52F92601.7040700@oracle.com>

Hi there,

Here you go, the early estimates for enforcing access atomicity:
 http://shipilev.net/blog/2014/all-accesses-are-atomic/

Go straight to "Conclusion" for TL;DR summary. In short, in 2014, most
platforms are already able to pull off 64-bit accesses, and hence it
seems redundant to keep the 64-bit exception in the spec.

Thanks,
-Aleksey.

From aleksey.shipilev at oracle.com  Mon Feb 10 11:18:33 2014
From: aleksey.shipilev at oracle.com (Aleksey Shipilev)
Date: Mon, 10 Feb 2014 23:18:33 +0400
Subject: [jmm-dev] Enforcing access atomicity (benchmarks)
Message-ID: <52F92609.5070407@oracle.com>

Hi there,

Here you go, the early estimates for enforcing access atomicity:
 http://shipilev.net/blog/2014/all-accesses-are-atomic/

Go straight to "Conclusion" for TL;DR summary. In short, in 2014, most
platforms are already able to pull off 64-bit accesses, and hence it
seems redundant to keep the 64-bit exception in the spec.

Thanks,
-Aleksey.

From dl at cs.oswego.edu  Fri Feb 14 05:49:36 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Fri, 14 Feb 2014 08:49:36 -0500
Subject: [jmm-dev] Enforcing access atomicity (benchmarks)
In-Reply-To: <52F92609.5070407@oracle.com>
References: <52F92609.5070407@oracle.com>
Message-ID: <52FE1EF0.30006@cs.oswego.edu>

On 02/10/2014 02:18 PM, Aleksey Shipilev wrote:
> Here you go, the early estimates for enforcing access atomicity:
>   http://shipilev.net/blog/2014/all-accesses-are-atomic/
>
> Go straight to "Conclusion" for TL;DR summary. In short, in 2014, most
> platforms are already able to pull off 64-bit accesses

Thanks! This prodded me to further investigate a few issues:

Are there ANY platforms that do or could otherwise support JVMs
and for which there is no reasonable way to conform?

The answer depends on how far you want to stretch "reasonable"
(worst case, JVMs could insert locks), but some 32bit versions
of PPC and MIPS seem problematic. Also, it might be the case that
floating-point (double) on ARM (even ARMv8) requires special handling.

The answer also depends on what you mean by "JVM".
Java "ME" (M for Mobile) specs have not kept pace with
the "SE" specs that we've implicitly been targeting.
Most but probably not all problematic cases are only
relevant for ME anyway.

Backing up, the main reason for contemplating this is spec
simplification. Getting rid of non-obvious rules and special
cases one by one may eventually result in a model/spec that
overcomes the "you are not expected to understand this"
reputation of the JMM among developers.

An argument for not simplifying is that programs shouldn't
have any races where non-atomicity would be observable anyway.
It's a pretty good argument, although not very convincing
to some developers writing code for monitoring and profiling,
as well as some numerical heuristics. They often could care less
about race-freedom so long as they arrive at empirically
acceptable approximations of reality. And in any case,
the presence of potential non-atomicity causing reads of a
long or double to rarely take crazy/wild values only
on uncommon platforms is not a very nice way to alert people
of problems.

Another argument for not simplifying is that (as Brian mentioned)
we expect JDK9 to support wider value types of some sort;
surely including those for which no processor guarantees
atomicity.  So there will always be atomicity disclaimers of
some kind somewhere.

Across these concerns, it seems that resolving this issue is
mostly a policy decision. I welcome any more compelling
arguments on either side than I listed above. Without
them, this might not become settled until (much) later when
canvassing broader community input.

-Doug


From dl at cs.oswego.edu  Sun Feb 16 15:00:30 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Sun, 16 Feb 2014 18:00:30 -0500
Subject: [jmm-dev] stores
Message-ID: <5301430E.9010009@cs.oswego.edu>


Memory models can generate a fair amount of excitement.
See Linus Torvalds's post on the linux kernel list:
   https://lkml.org/lkml/2014/2/14/492
and follow-ups with Paul McKenney. (Condolences!)

I don't think this introduces anything new with respect
to JMM9 discussions so far though. In general, speculative
stores and out-of-thin-air reads break basic safety properties.
Although there still might be some related open cases
about inserted stores, including "redundant" ones. As in:
     if (x != 0) x = 0;
==> x = 0; ?

-Doug


From david.holmes at oracle.com  Sun Feb 16 16:47:06 2014
From: david.holmes at oracle.com (David Holmes)
Date: Mon, 17 Feb 2014 10:47:06 +1000
Subject: [jmm-dev] Enforcing access atomicity (benchmarks)
In-Reply-To: <52FE1EF0.30006@cs.oswego.edu>
References: <52F92609.5070407@oracle.com> <52FE1EF0.30006@cs.oswego.edu>
Message-ID: <53015C0A.204@oracle.com>

On 14/02/2014 11:49 PM, Doug Lea wrote:
> On 02/10/2014 02:18 PM, Aleksey Shipilev wrote:
>> Here you go, the early estimates for enforcing access atomicity:
>>   http://shipilev.net/blog/2014/all-accesses-are-atomic/
>>
>> Go straight to "Conclusion" for TL;DR summary. In short, in 2014, most
>> platforms are already able to pull off 64-bit accesses
>
> Thanks! This prodded me to further investigate a few issues:
>
> Are there ANY platforms that do or could otherwise support JVMs
> and for which there is no reasonable way to conform?
>
> The answer depends on how far you want to stretch "reasonable"
> (worst case, JVMs could insert locks), but some 32bit versions
> of PPC and MIPS seem problematic. Also, it might be the case that
> floating-point (double) on ARM (even ARMv8) requires special handling.

Not just double but also float. The ARMv8 spec not only doesn't provide 
single-copy atomicity for 64-bit FP values but it even rolls back the 
32-bit guarantees, for FP, to only providing byte-level single-copy 
atomicity! So in theory a float/double can be loaded/stored one byte at 
a time!

Perhaps our ARM folk could comment on this as we've been used to getting 
32-bit atomic accesses on 32-bit platforms for an awful long time now.

> The answer also depends on what you mean by "JVM".
> Java "ME" (M for Mobile) specs have not kept pace with
> the "SE" specs that we've implicitly been targeting.
> Most but probably not all problematic cases are only
> relevant for ME anyway.

SE Embedded is impacted by this.

> Backing up, the main reason for contemplating this is spec
> simplification. Getting rid of non-obvious rules and special
> cases one by one may eventually result in a model/spec that
> overcomes the "you are not expected to understand this"
> reputation of the JMM among developers.

I think the non-atomic treatment of long/double unless volatile is so 
isolated in the memory-model, and so stand-alone and simple, that 
removing it would be imperceptible in the overall complexity of the JMM.

David
------

> An argument for not simplifying is that programs shouldn't
> have any races where non-atomicity would be observable anyway.
> It's a pretty good argument, although not very convincing
> to some developers writing code for monitoring and profiling,
> as well as some numerical heuristics. They often could care less
> about race-freedom so long as they arrive at empirically
> acceptable approximations of reality. And in any case,
> the presence of potential non-atomicity causing reads of a
> long or double to rarely take crazy/wild values only
> on uncommon platforms is not a very nice way to alert people
> of problems.
>
> Another argument for not simplifying is that (as Brian mentioned)
> we expect JDK9 to support wider value types of some sort;
> surely including those for which no processor guarantees
> atomicity.  So there will always be atomicity disclaimers of
> some kind somewhere.
>
> Across these concerns, it seems that resolving this issue is
> mostly a policy decision. I welcome any more compelling
> arguments on either side than I listed above. Without
> them, this might not become settled until (much) later when
> canvassing broader community input.
>
> -Doug
>

From aleksey.shipilev at oracle.com  Mon Feb 17 00:56:56 2014
From: aleksey.shipilev at oracle.com (Aleksey Shipilev)
Date: Mon, 17 Feb 2014 12:56:56 +0400
Subject: [jmm-dev] Enforcing access atomicity (benchmarks)
In-Reply-To: <52FE1EF0.30006@cs.oswego.edu>
References: <52F92609.5070407@oracle.com> <52FE1EF0.30006@cs.oswego.edu>
Message-ID: <5301CED8.6040706@oracle.com>

On 02/14/2014 05:49 PM, Doug Lea wrote:
> On 02/10/2014 02:18 PM, Aleksey Shipilev wrote:
>> Here you go, the early estimates for enforcing access atomicity:
>>   http://shipilev.net/blog/2014/all-accesses-are-atomic/
>>
>> Go straight to "Conclusion" for TL;DR summary. In short, in 2014, most
>> platforms are already able to pull off 64-bit accesses
> 
> Are there ANY platforms that do or could otherwise support JVMs
> and for which there is no reasonable way to conform?
> 
> The answer depends on how far you want to stretch "reasonable"
> (worst case, JVMs could insert locks), but some 32bit versions
> of PPC and MIPS seem problematic. Also, it might be the case that
> floating-point (double) on ARM (even ARMv8) requires special handling.

I would *really* like the feedback from ARM folks on this, because the
performance experiments need a properly functional and correct access
code for all the platforms. While we haven't found the empirical
evidence those we have now are broken, it might be just the luck?

> An argument for not simplifying is that programs shouldn't
> have any races where non-atomicity would be observable anyway.
> It's a pretty good argument, although not very convincing
> to some developers writing code for monitoring and profiling,
> as well as some numerical heuristics. They often could care less
> about race-freedom so long as they arrive at empirically
> acceptable approximations of reality. And in any case,
> the presence of potential non-atomicity causing reads of a
> long or double to rarely take crazy/wild values only
> on uncommon platforms is not a very nice way to alert people
> of problems.

+1. Having written some sophisticated high-performance code in Java, I
am excited about the access atomicity guarantees, when you are dealing
with the eventually-consistent code. I would be more relaxed about
long/double exception for ordinary load/stores, as long as there is a
way to achieve *only* the access atomicity, without burdening myself
with the memory semantics around volatiles. That's one of the things my
post was trying to showcase: the add-on of volatile semantics
significantly increases the costs comparing to "just" the atomic access.

> Another argument for not simplifying is that (as Brian mentioned)
> we expect JDK9 to support wider value types of some sort;
> surely including those for which no processor guarantees
> atomicity.  So there will always be atomicity disclaimers of
> some kind somewhere.

I think there is a slight bias in the way we ask the question.

We call it "drop the access atomicity exception", while we really should
discuss "requiring the access atomicity for 64-bit types as well". The
argument about value types surely fits the former discussion, because
why dropping the exception, if we are about to reintroduce it? The
latter one is more interesting: where we draw the line about what
accesses are atomic, and what are not.

-Aleksey.

From dl at cs.oswego.edu  Mon Feb 17 06:00:51 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Mon, 17 Feb 2014 09:00:51 -0500
Subject: [jmm-dev] Enforcing access atomicity (benchmarks)
In-Reply-To: <53015C0A.204@oracle.com>
References: <52F92609.5070407@oracle.com> <52FE1EF0.30006@cs.oswego.edu>
	<53015C0A.204@oracle.com>
Message-ID: <53021613.6020409@cs.oswego.edu>

On 02/16/2014 07:47 PM, David Holmes wrote:

>
> I think the non-atomic treatment of long/double unless volatile is so isolated
> in the memory-model, and so stand-alone and simple, that removing it would be
> imperceptible in the overall complexity of the JMM.
>

True in the JSR133 JMM, but for JMM9, I think we'd like to
equate non-volatile access with relaxed mode of volatile.
To preserve this, we need more modes. Maybe we do anyway.
Stealing a term from clang (http://llvm.org/docs/Atomics.html),
we could use "monotonic", that combines atomicity with not
allowing time to run backwards. Here's how this might fit in to
the One Page Memory Model (that includes a big cheat for now.)

...

1. A program consists of one or more .class files containing
bytecodes, typically translated from a source language. A program
starts by accessing an object constructed in accord with a given
.class file.

2. Any read (i.e., a get* bytecode) returns a value written (i.e., a
put* bytecode) by some thread, as constrained by rules below, or in
the absence of well-founded constraints, zero (0/0.0/false/null).
All accesses are guaranteed atomic except for those of long or double
variables that are either non-volatile or accessed in Relaxed mode.

3. The order of (get* put*) bytecodes accessing ordinary variables or
access invocations in lRelaxed mode for volatile variables imposes no
ordering constraints on execution except for the following:
   a. Indirect read ordering is always preserved; i.e., any
      such read is equivalent to getDependent().
   b. Field assignments within constructors are always ordered before
      subsequent program-order assignments of references to the
      constructed objects; i.e., each such store is equivalent
      to field.setDependent(constructedObject).
   c. Data-race-free programs using monitor locks are sequentially
      consistent. [The big cheat for now.]

4. Explicit ordering control is available using volatiles, fence
methods, and .volatile expressions, as follows, illustrated using the
.volatile form:
   * Monotonic mode
     v.getMonotonic() and v.setMonotonic(x)
     given a: v.getMonotonic() and b: v.getMonotonic(),
     and a is before b in bytcode order, then it is not the
     case that b is ordered before a in any execution. [Etc.]
   * Acquire/Release mode.
     v.getAcquire() and v.setRelease(x)
     [explain...]
   * Indirection dependent mode.
     v.getDependent(ref) and v.setDependent(x, ref)
     [explain as scoped acquire/release...]
   * Sequential mode.
     v.getSequential() and v.setSequential(x)
     [explain...]

5. Other misc: Thread.start etc.


From Peter.Sewell at cl.cam.ac.uk  Mon Feb 17 08:45:59 2014
From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell)
Date: Mon, 17 Feb 2014 16:45:59 +0000
Subject: [jmm-dev] thin-air summary
Message-ID: <CAHWkzRS-VxgXusV-jqRstCO5-0bb6=Miv9FrA=pcNW8TiZ7+dg@mail.gmail.com>

Dear all,

Mark Batty and I have written a short note trying to summarise the
thin-air problem as crisply as we can:

http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html

Comments welcome, of course.   We've also been thinking here about
possible approaches; hopefully we'll have another note about that in a
few days.

Peter

From david.holmes at oracle.com  Mon Feb 17 20:14:12 2014
From: david.holmes at oracle.com (David Holmes)
Date: Tue, 18 Feb 2014 14:14:12 +1000
Subject: [jmm-dev] thin-air summary
In-Reply-To: <CAHWkzRS-VxgXusV-jqRstCO5-0bb6=Miv9FrA=pcNW8TiZ7+dg@mail.gmail.com>
References: <CAHWkzRS-VxgXusV-jqRstCO5-0bb6=Miv9FrA=pcNW8TiZ7+dg@mail.gmail.com>
Message-ID: <5302DE14.8080709@oracle.com>

Hi Peter,

On 18/02/2014 2:45 AM, Peter Sewell wrote:
> Dear all,
>
> Mark Batty and I have written a short note trying to summarise the
> thin-air problem as crisply as we can:
>
> http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
>
> Comments welcome, of course.   We've also been thinking here about
> possible approaches; hopefully we'll have another note about that in a
> few days.

I'm a lay-person when it comes to the formalities of all this but given:

4 Example LB+ctrl+data+ctrl-double (language must allow)

r1 = x;        // reads 42
if (r1 == 42)
   y = r1;
---------------------------
r2 = y;        // reads 42
if (r2 == 42)
   x = 42;
else
   x = 42;

the compiler optimization would elide the conditional and simply do the 
store, so this then reduces to:

r1 = x;        // reads 42
if (r1 == 42)
   y = r1;
---------------------------
r2 = y;        // reads 42
x = 42;

which, as stated, now matches "3 Example LB+ctrl+data+po". But in that 
case I don't understand how you can say that for "5 Example 
LB+ctrl+data+ctrl-single" "the candidate execution that we want to 
forbid here is identical to the execution of the previous example that 
we have to allow" - as 5 has a conditional and 4 no longer does, hence 
they are no longer the same ?

Further/similarly, it would seem based on these examples (and I realize 
that there may well be other examples that show otherwise) that the 
straw-man of prohibiting the (rf+dep) cycle would hold if you first 
reduced the code to its "minimal" form ie once 4 is reduced to 3 there 
is no cycle.

Of course I may have just shifted the problem into one of being able to 
define what a "minimal" form is. :)

David
-----

> Peter
>

From Peter.Sewell at cl.cam.ac.uk  Tue Feb 18 00:32:51 2014
From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell)
Date: Tue, 18 Feb 2014 08:32:51 +0000
Subject: [jmm-dev] thin-air summary
In-Reply-To: <5302DE14.8080709@oracle.com>
References: <CAHWkzRS-VxgXusV-jqRstCO5-0bb6=Miv9FrA=pcNW8TiZ7+dg@mail.gmail.com>
	<5302DE14.8080709@oracle.com>
Message-ID: <CAHWkzRQtXr489aS-Y34zFJ96zZDJxFsEJR1_g2ZWsVO6F8g_oA@mail.gmail.com>

On 18 February 2014 04:14, David Holmes <david.holmes at oracle.com> wrote:
> Hi Peter,
>
>
> On 18/02/2014 2:45 AM, Peter Sewell wrote:
>>
>> Dear all,
>>
>> Mark Batty and I have written a short note trying to summarise the
>> thin-air problem as crisply as we can:
>>
>> http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
>>
>> Comments welcome, of course.   We've also been thinking here about
>> possible approaches; hopefully we'll have another note about that in a
>> few days.
>
>
> I'm a lay-person when it comes to the formalities of all this but given:
>
> 4 Example LB+ctrl+data+ctrl-double (language must allow)
>
> r1 = x;        // reads 42
> if (r1 == 42)
>   y = r1;
> ---------------------------
> r2 = y;        // reads 42
> if (r2 == 42)
>   x = 42;
> else
>   x = 42;
>
> the compiler optimization would elide the conditional and simply do the
> store, so this then reduces to:
>
> r1 = x;        // reads 42
> if (r1 == 42)
>   y = r1;
> ---------------------------
> r2 = y;        // reads 42
> x = 42;
>
> which, as stated, now matches "3 Example LB+ctrl+data+po". But in that case
> I don't understand how you can say that for "5 Example
> LB+ctrl+data+ctrl-single" "the candidate execution that we want to forbid
> here is identical to the execution of the previous example that we have to
> allow" - as 5 has a conditional and 4 no longer does, hence they are no
> longer the same ?

What we're showing here is that "shows that thin-air executions cannot
be forbidden by any per-candidate-execution condition **using the
C/C++11 notion of candidate executions**".  That notion (summarised in
Section 2.5 of the Batty et al. POPL11 paper
<http://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf>) is not a
trace of machine instructions, but instead a set of memory access
events together with various relations over them.  Branches don't
appear explicitly, and those relations don't include control
dependency.

> Further/similarly, it would seem based on these examples (and I realize that
> there may well be other examples that show otherwise) that the straw-man of
> prohibiting the (rf+dep) cycle would hold if you first reduced the code to
> its "minimal" form ie once 4 is reduced to 3 there is no cycle.
>
> Of course I may have just shifted the problem into one of being able to
> define what a "minimal" form is. :)

indeed...

Peter

From dl at cs.oswego.edu  Wed Feb 19 05:54:00 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Wed, 19 Feb 2014 08:54:00 -0500
Subject: [jmm-dev] Enhanced Volatiles
Message-ID: <5304B778.3040104@cs.oswego.edu>


Just as an FYI, I submitted the JEP pasted below, that includes
a few small updates reflecting feedback on openjdk lists.

...

Title: Enhanced Volatiles
Author: Doug Lea
Organization: SUNY Oswego
Created: 2014/01/06
Type: Feature
State: Draft
Exposure: Open
Component: core/libs core/lang vm/rt
Scope: JDK
Discussion: core-libs-dev at openjdk.java.net compiler-dev at openjdk.java.net 
hotspot-dev at openjdk.java.net
Start: 2014/Q1
Depends: JEP-188
Effort: M
Duration: L
Template: 1.0
Reviewed-by:
Endorsed-by: Brian Goetz

Summary
-------

This JEP results in a means for programmers to invoke the equivalents
of java.util.concurrent.atomic methods on object fields.

Motivation
----------

As concurrent and parallel programming in Java continue to expand,
programmers are increasingly frustrated by not being able to use Java
constructions for arranging atomic or ordered operations for the
fields of individual classes; for example atomically incrementing a
"count" field. Until now the only ways to achieve these effects were
to use a stand-alone AtomicInteger (adding both space overhead and
additional concurrency issues to manage indirection) or, in some
situations, to use atomic FieldUpdaters (often encountering more
overhead than the operation itself), or to use JVM Unsafe
intrinsics. Because intrinsics are preferable on performance grounds,
their use has been increasingly common, to the detriment of safety and
portability. Without this JEP, these problems are expected to become
worse as atomic APIs expand to cover additional access consistency
policies (aligned with the recent C++11 memory model) as part of Java
Memory Model revisions.

Description
-----------

The target solution requires a syntax enhancement, a few library
enhancements, and compiler support.

We model the extended operations on volatile integers via an interface
VolatileInt, that also captures the functionality of AtomicInteger
(which will also be updated to reflect Java Memory Model revisions as
part of this JEP). A tentative version is below. Similar interfaces
are needed for other primitive and reference types.

We then enable access to corresponding methods for fields using the
".volatile" prefix. For example:

     class Usage {
         volatile int count;
         int incrementCount() {
             return count.volatile.incrementAndGet();
         }
     }

The ".volatile" syntax is slightly unusual, but we are confident that
it is syntactically unambiguous and semantically specifiable.  New
syntax is required to avoid ambiguities with existing usages,
especially for volatile references -- invocations of methods on the
reference versus the referent would be indistinguishable.  The
".volatile" prefix introduces a scope for operations on these
"L-values", not their retrieved contents.  However, just using the
prefix itself without a method invocation (as in "count.volatile;")
would be meaningless and illegal.  We also expect to allow volatile
operations on array elements in addition to fields. Enforcement of
semantic restrictions (for example attempted usages for "final"
fields) will require compiler support.

The main task is to translate these calls into corresponding JVM
intrinsics. The most likely option is for the source compiler to use
method handles. This and other techniques are known to suffice, but
are subject to further exploration. Minor enhancements to intrinsics
and a few additional JDK library methods may also be needed.

Here is a tentative VolatileInt interface.  Those for other types are
similar.  The final released versions will surely differ, subject
to the results of JEP-188.

     interface VolatileInt {
         int get();
         int getRelaxed();
         int getAcquire();
         int getSequential();

         void set(int x);
         void setRelaxed(int x);
         void setRelease(int x);
         void setSequential(int x);

         int getAndSet(int x);
         boolean compareAndSet(int e, int x);
         boolean compareAndSetAcquire(int e, int x);
         boolean compareAndSetRelease(int e, int x);
         boolean weakCompareAndSet(int e, int x);
         boolean weakCompareAndSetAcquire(int e, int x);
         boolean weakCompareAndSetRelease(int e, int x);

         int getAndAdd(int x);
         int addAndGet(int x);
         int getAndIncrement();
         int incrementAndGet();
         int getAndDecrement();
         int decrementAndGet();
     }

This proposal focuses on the control of atomicity and ordering for
single variables. We expect the resulting specifications to be
amenable for extension in natural ways for additional primitive-like
value types, if they are ever defined for Java. However, it is not a
general-purpose transaction mechanism for controlling accesses and
updates to multiple variables.  Alternative forms for expressing and
implementing such constructions may be explored in the course of this
JEP, and may be the subject of further JEPs.


Alternatives
------------

We considered instead introducing new forms of "value type" that
support volatile operations. However, this would be inconsistent with
properties of other types, and would also require more effort for
programmers to use. We also considered expanding reliance on
java.util.concurrent.atomic FieldUpdaters, but their dynamic overhead
and usage limitations make them unsuitable. Several other alternatives
(including those based on field references) have been raised and
dismissed as unworkable on syntactic, efficiency, and/or usability
grounds over the many years that these issues have been discussed.


Risks and Assumptions
---------------------

We are confident of feasibility.  However, we expect that it will
require more experimentation to arrive at compilation techniques that
result in efficient enough implementation for routine use in the
performance-critical contexts where these constructs are most often
needed. The use of method handles may be impacted by and may impact
JVM method handle support.

Impact
------

A large number of usages in java.util.concurrent (and a few elsewhere
in JDK) could be simplified and updated to use this support.


From ajeffrey at bell-labs.com  Wed Feb 19 12:07:02 2014
From: ajeffrey at bell-labs.com (Alan Jeffrey)
Date: Wed, 19 Feb 2014 14:07:02 -0600
Subject: [jmm-dev] LTL specification of relaxed memory
Message-ID: <53050EE6.3050507@bell-labs.com>

Hi everyone,

I've been messing around with proving the DRF theorem for a Mazurkeiwicz 
trace model of relaxed memory. I'm pretty close to convincing the Agda 
theorem prover that this is true, most of the time has been spent coming 
up with good definitions. I think the definitions are in a state worth 
sharing...

Recall that a Mazurkeiwicz trace model consists of an alphabet Sigma, 
with a binary "independence" relation I. This induces an equivalence ~ 
on Sigma^* given as the smallest congruence containing:

   ab ~ ba  (for any (a,b) in I)

We can define a variant of past time Linear Temporal Logic, whose 
semantics is given as subsets of Sigma^*. The usual operators of LTL are:

   epsilon not in (prev phi)
   sa in (prev phi) whenever s in phi

   epsilon in (wprev phi)
   sa in (wprev phi) whenever s in phi

   (always phi) = (phi and wprev(always phi))
   (sometime phi) = (phi or prev(sometime phi))
   (phi since psi) = (psi or (phi and prev(phi since psi))

The interesting new operator is a permutation operator:

   s in (permute phi) whenever t in phi for some s ~ t

 From permute we can define a "previous state up to permutation" as:

   (pprev phi) =
     exists(a) (a and permute((not a) since (a and prev(phi))))

unpacking this, sa in (pprev phi) whenever there is some sa ~ tau, a is 
not in u, and t is in phi.

LTL can be used to specify the relaxed memory model we're interested in 
(I think). Making use of two new binary relations on Sigma:

   C thought of as "read-write conflict"
   J thought of as "read-write justification"

the canonical example being:

   (W x=v, W x=w) in C
   (R x=v, W x=w) in C
   (W x=v, R x=w) in C
   (W x=v, R x=v) in J

The LTL spec for sequential consistency is:

   start = always false
   justified(a) = start or (not(C(a)) since J(a))
   sconsistent = always forall(a) (a implies prev(justified(a))

Unpacking this...

  * start is only true on the empty trace epsilon.
  * justified(a) is true either if a is the initial action or
    we can find a past action b which justifies a, and
    there is no action c between a and b in conflict with a.
  * sconsistent is true if every action is preceded by a justifier.

After all this, the LTL spec for relaxed consistency is:

   rconsistent = always forall(a) (a implies pprev(justified(a))

that is, the only difference between sequential consistency and relaxed 
consistency is whether we use "prev" (previous state) or "pprev" 
(previous state up to permutation). For example, the canonical trace for 
relaxed memory is:

   s = (1: W x=0) (1: W x=1) (1: W y=1) (2: R y=1) (2: R x=0)

We can double-check that s not in sconsistent (since the justifier for 
the action (2: R x=0) is (1: W x=0) but there is an intervening action 
(1: W x=1) which is conflict with (2: R x=0)). On the other hand s is in 
rconsistent, since we have:

   s ~ (1: W x=0) (2: R x=0) (1: W x=1) (1: W y=1) (2: R y=1)

and so (1: W x=0) can act as the justifier up to permutation.

Just to finish off the problem spec, we can define the DRF property as:

   datarace = sometime exists(a) (a and pprev(C(a)))
   drf = sconsistent implies not datarace

so the problem is find conditions on P such that if [P implies drf] and 
[P implies rconsistent] then [P implies sconsistent].

Alan.

From dl at cs.oswego.edu  Wed Feb 19 16:56:35 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Wed, 19 Feb 2014 19:56:35 -0500
Subject: [jmm-dev] LTL specification of relaxed memory
In-Reply-To: <53050EE6.3050507@bell-labs.com>
References: <53050EE6.3050507@bell-labs.com>
Message-ID: <530552C3.5010008@cs.oswego.edu>

On 02/19/2014 03:07 PM, Alan Jeffrey wrote:

>    datarace = sometime exists(a) (a and pprev(C(a)))
>    drf = sconsistent implies not datarace
>
> so the problem is find conditions on P such that if [P implies drf] and [P
> implies rconsistent] then [P implies sconsistent].
>

Where, as a first step, the conditions amount to some representation
of lock-based access?

-Doug


From ajeffrey at bell-labs.com  Thu Feb 20 07:34:40 2014
From: ajeffrey at bell-labs.com (Alan Jeffrey)
Date: Thu, 20 Feb 2014 09:34:40 -0600
Subject: [jmm-dev] LTL specification of relaxed memory
In-Reply-To: <530552C3.5010008@cs.oswego.edu>
References: <53050EE6.3050507@bell-labs.com> <530552C3.5010008@cs.oswego.edu>
Message-ID: <53062090.9020505@bell-labs.com>

Locking, volatiles, etc. should be treated by the independence and 
justification relations. For example a simple model of locks would be 
something like:

   (m: (un)lock p) I (n: (un)lock q)   when m != n and p != q
   (m: (un)lock p) I (n: R/W x=v)      when m != n
   (m: lock p)     J (n: unlock p)
   (m: unlock p)   J (n: lock p)
   (init)          J (n: lock p)

I'm hoping that there's a separation of concerns here, where the DRF 
theorem can be proved for any suitable I, J and C, and that a variety of 
memory models can be investigated by varying I, J and C.

A.

On 02/19/2014 06:56 PM, Doug Lea wrote:
> On 02/19/2014 03:07 PM, Alan Jeffrey wrote:
>
>>    datarace = sometime exists(a) (a and pprev(C(a)))
>>    drf = sconsistent implies not datarace
>>
>> so the problem is find conditions on P such that if [P implies drf]
>> and [P
>> implies rconsistent] then [P implies sconsistent].
>>
>
> Where, as a first step, the conditions amount to some representation
> of lock-based access?
>
> -Doug
>
>

From luc.maranget at inria.fr  Thu Feb 20 08:50:00 2014
From: luc.maranget at inria.fr (Luc Maranget)
Date: Thu, 20 Feb 2014 17:50:00 +0100
Subject: [jmm-dev] thin-air summary
In-Reply-To: <CAHWkzRS-VxgXusV-jqRstCO5-0bb6=Miv9FrA=pcNW8TiZ7+dg@mail.gmail.com>
References: <CAHWkzRS-VxgXusV-jqRstCO5-0bb6=Miv9FrA=pcNW8TiZ7+dg@mail.gmail.com>
Message-ID: <20140220165000.GA29036@yquem.inria.fr>

Dear all,

We have extended our litmus testing infrastructure so as to handle C11
(small) programs.

I have just run Mark and Peter examples on one ARM system
(DragonBroard, running some old android)
with experimental gcc cross compiler with -O2
(arm-linux-gnueabi-gcc (GCC) 4.9.0 20140213 (experimental))

We exactly observe the results predicted by Peter in his note
on the  first five examples (we cannot handle the sixth example yet)


                        |Kind  | APQ8060     
---------------------------------------------
---------------------------------------------
LB                      |Allow | Ok, 332/100M
---------------------------------------------
LB+datas                |Forbid| Ok, 0/100M  
---------------------------------------------
LB+ctrl+data+po         |Allow | Ok, 2/100M  
---------------------------------------------
LB+ctrl+data+ctrl-double|Allow | Ok, 6/100M  
---------------------------------------------
LB+ctrl+data+ctrl-single|Forbid| Ok, 0/100M  


--Luc


> Dear all,
> 
> Mark Batty and I have written a short note trying to summarise the
> thin-air problem as crisply as we can:
> 
> http://www.cl.cam.ac.uk/~pes20/cpp/notes42.html
> 
> Comments welcome, of course.   We've also been thinking here about
> possible approaches; hopefully we'll have another note about that in a
> few days.
> 
> Peter

-- 
Luc


From ajeffrey at bell-labs.com  Thu Feb 20 15:17:33 2014
From: ajeffrey at bell-labs.com (Alan Jeffrey)
Date: Thu, 20 Feb 2014 17:17:33 -0600
Subject: [jmm-dev] LTL specification of relaxed memory
In-Reply-To: <53062090.9020505@bell-labs.com>
References: <53050EE6.3050507@bell-labs.com> <530552C3.5010008@cs.oswego.edu>
	<53062090.9020505@bell-labs.com>
Message-ID: <53068D0D.1040401@bell-labs.com>

As promised, I now have a proof of the DRF theorem using the LTL 
formulation of DRF and relaxed consistency. The proof has gone through 
the Agda proof checker.

I needed an auxiliary definition, of "compatible action". Define:

   b is compatible with a whenever
     (a I c) implies (b I c) for any c and
     (a C c) implies (b C c) for any c

Note that if b is compatible with a and sa has a data race, then sb has 
a data race.

The requirements on I, J and C are pretty tame:

  * I is symmetric and irreflexive
  * C is symmetric
  * if b in J(a) and c in C(a) then c in C(b)

The result is that for any set of traces S which satisfies the following 
conditions:

  * S is prefix-closed (that is if sa in S then s in S)
  * S is justification-enabled (that is if sa in S
    then sb in S for some b compatible with a
    and b is justified by s)
  * S is DRF (that is any sequentially consistent s has no data race)
  * S is relaxed consistent

we have:

  * S is sequentially consistent

Note that there is no notion of commitment or having to use multiple 
executions for justification.

The next step is to check this definition against the torture test...

A.


On 02/20/2014 09:34 AM, Alan Jeffrey wrote:
> Locking, volatiles, etc. should be treated by the independence and
> justification relations. For example a simple model of locks would be
> something like:
>
>    (m: (un)lock p) I (n: (un)lock q)   when m != n and p != q
>    (m: (un)lock p) I (n: R/W x=v)      when m != n
>    (m: lock p)     J (n: unlock p)
>    (m: unlock p)   J (n: lock p)
>    (init)          J (n: lock p)
>
> I'm hoping that there's a separation of concerns here, where the DRF
> theorem can be proved for any suitable I, J and C, and that a variety of
> memory models can be investigated by varying I, J and C.
>
> A.
>
> On 02/19/2014 06:56 PM, Doug Lea wrote:
>> On 02/19/2014 03:07 PM, Alan Jeffrey wrote:
>>
>>>    datarace = sometime exists(a) (a and pprev(C(a)))
>>>    drf = sconsistent implies not datarace
>>>
>>> so the problem is find conditions on P such that if [P implies drf]
>>> and [P
>>> implies rconsistent] then [P implies sconsistent].
>>>
>>
>> Where, as a first step, the conditions amount to some representation
>> of lock-based access?
>>
>> -Doug
>>
>>

From dl at cs.oswego.edu  Sat Feb 22 07:59:21 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Sat, 22 Feb 2014 10:59:21 -0500
Subject: [jmm-dev] Sequential Consistency
Message-ID: <5308C959.80502@cs.oswego.edu>


Another in the continuing series of issues to contemplate:

There's a tension between those who believe that all "correct"
programs are provably sequentially consistent versus those who
consider sequential consistency as a goal only of lock-based programs;
not necessarily of those using lock-free techniques and/or are
components of distributed systems. (see for example Herlihy & Shavit's
"The Art of Multiprocessor Programming"
http://store.elsevier.com/The-Art-of-Multiprocessor-Programming/Maurice-Herlihy/isbn-9780080569581/)

No one disagrees about the need for a memory model guaranteeing that
DRF lock-based programs are sequentially consistent.

Other cases may be less clear cut.  For the most famous example: Can a
program using non-lock-based techniques (for example, using Java
volatile loads/stores) be "correct" if it fails some variant of the IRIW
test?  Is IRIW conformance an unnecessary action-at-a-distance
by-product of SC, or does it play some intrinsically useful role in
assuring correctness?

IRIW is not the only example of a case in which SC imposes conditions
that some programmers in some contexts seem not to care about.  But
it is most famous because it so clearly impacts the nature and cost of
mappings (for various modes of load, store, and CAS) on some existing
processors as well as potential mappings on future processors.

I won't yet try to summarize different positions and rationales,
but for now just invite further discussion.

-Doug

PS: As a reminder, here's IRIW. Given global x, y:
   Thread 1: x = 1;
   Thread 2: y = 1;
   Thread 3: r1 = x; r2 = y;  // sees r1 == 1, r2 == 0
   Thread 4: r3 = y; r4 = x;  // sees r3 == 1, r4 == 0


From jeremymanson at google.com  Sat Feb 22 11:58:01 2014
From: jeremymanson at google.com (Jeremy Manson)
Date: Sat, 22 Feb 2014 11:58:01 -0800
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <5308C959.80502@cs.oswego.edu>
References: <5308C959.80502@cs.oswego.edu>
Message-ID: <CAPYFHW0ce66AHwfAw9WUJSAK0CbHfSsP4xwC4bDJswhCCd9huw@mail.gmail.com>

On Sat, Feb 22, 2014 at 7:59 AM, Doug Lea <dl at cs.oswego.edu> wrote:

>
> Another in the continuing series of issues to contemplate:
>
> There's a tension between those who believe that all "correct"
> programs are provably sequentially consistent versus those who
> consider sequential consistency as a goal only of lock-based programs;
> not necessarily of those using lock-free techniques and/or are
> components of distributed systems. (see for example Herlihy & Shavit's
> "The Art of Multiprocessor Programming"
> http://store.elsevier.com/The-Art-of-Multiprocessor-
> Programming/Maurice-Herlihy/isbn-9780080569581/)
>
>
Who falls into the first category?  A "correct" program is one where the
behavior matches the spec, and if that can be done with non-SC behavior
(which it often can), then the conversation is over.

I think the major limiting factor for volatiles and atomics supporting SC
(which is how I read what you are asking) is whether it can be done
reasonably (i.e., with acceptable performance) on the target platforms.  If
it can, then for everyone's sanity (and in keeping with the desire for Java
to have somewhat accessible semantics for stuff like this), it makes sense
to specify them as being SC.  If it can't, then (IMO) the IRIW-alike idioms
are few and far between enough that it makes no sense to try to decrease
everyone's performance to support SC for them.

Jeremy

From john.r.rose at oracle.com  Sat Feb 22 14:55:11 2014
From: john.r.rose at oracle.com (John Rose)
Date: Sat, 22 Feb 2014 14:55:11 -0800
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <5308C959.80502@cs.oswego.edu>
References: <5308C959.80502@cs.oswego.edu>
Message-ID: <326E34A2-0606-4100-BA43-42A3DAF4700E@oracle.com>

On Feb 22, 2014, at 7:59 AM, Doug Lea <dl at cs.oswego.edu> wrote:

> IRIW conformance an unnecessary action-at-a-distance
> by-product of SC, or does it play some intrinsically useful role in
> assuring correctness

It sounds like you are asking someone to speak up for the usefulness of SC when the two bits of test state (global x, y) are at an arbitrarily large "distance".

Though I am not well-read on this stuff, I'll venture related questions that seem relevant to the JMM and that appealing idea of distance.

If x and y are related together because they represent something coherent, SC could act as a fail-safe after everybody loses track of their relation.  It would be best not to lose track, though.

What are the cases where there is a sufficiently small "distance" that programmers would want SC?  One example would be an actor (variables used only by one thread).  Another would be an object under a mutex.  Or the two variables ("globals") are not under a mutex but are in the same object (cache line?) being racily used.  Or the variables are related more tenuously but all the threads agree in a single safely published access path to both.

Is there an idea of "distance" or locality for the JMM that would be useful to programmers?

Would it be useful to provide programmers SC within a singly-rooted connected subgraph of Java heap nodes?  Can we define such subgraphs in a way that is not sensitive to mutations in the connecting references?  Can we use ideas of reference variables which are immutable (time invariant) or have some monotonicity (set-once)?

Or does the research demonstrate IRIW-like anomalies for all sorts of "distances"?

? John

From aleksey.shipilev at oracle.com  Sun Feb 23 01:06:42 2014
From: aleksey.shipilev at oracle.com (Aleksey Shipilev)
Date: Sun, 23 Feb 2014 13:06:42 +0400
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <5308C959.80502@cs.oswego.edu>
References: <5308C959.80502@cs.oswego.edu>
Message-ID: <5309BA22.9090900@oracle.com>

On 02/22/2014 07:59 PM, Doug Lea wrote:
> Other cases may be less clear cut.  For the most famous example: Can a
> program using non-lock-based techniques (for example, using Java
> volatile loads/stores) be "correct" if it fails some variant of the IRIW
> test?  Is IRIW conformance an unnecessary action-at-a-distance
> by-product of SC, or does it play some intrinsically useful role in
> assuring correctness?

IMO, we are on a thin ice here. The absence of counter-examples how
non-SC behaviors for IRIW-like constructions demolish the correctness at
larger scale does not mean we wouldn't find the case where it breaks
badly in future, when the spec solidifies. In other words, absence of
evidence is not evidence of absence.

I, for one, would not like to wake up to another
double-checked-locking-like calamity because we allowed a particular
sneaky behavior in the name of performance. And yes, being the
performance guy, I still think strong correctness wins over performance
ten times over.

The relaxations are welcome, but only in a few very constrained places,
where you are able to relatively easy fix/rewrite the bad usages or even
provide stronger ad-hoc semantics. On other words, the things you allow
in a library (e.g. Linux RCU) are not the things you want to burn into a
language spec.

> IRIW is not the only example of a case in which SC imposes conditions
> that some programmers in some contexts seem not to care about.  But
> it is most famous because it so clearly impacts the nature and cost of
> mappings (for various modes of load, store, and CAS) on some existing
> processors as well as potential mappings on future processors.

Being the language guy, I think the hardware not being able to provide
the sane SC primitives should pay up the costs. The hardware which makes
it relatively easy to implement the non-tricky language memory model
should be in the sweet spot.

-Aleksey.


From dl at cs.oswego.edu  Sun Feb 23 05:59:41 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Sun, 23 Feb 2014 08:59:41 -0500
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <5308C959.80502@cs.oswego.edu>
References: <5308C959.80502@cs.oswego.edu>
Message-ID: <5309FECD.3000502@cs.oswego.edu>

On 02/22/2014 10:59 AM, Doug Lea wrote:
> I won't yet try to summarize different positions and rationales,
> but for now just invite further discussion.

That was too cowardly. Here's a shot at summarizing some of the
historical context.

> PS: As a reminder, here's IRIW. Given global x, y:
>    Thread 1: x = 1;
>    Thread 2: y = 1;
>    Thread 3: r1 = x; r2 = y;  // sees r1 == 1, r2 == 0
>    Thread 4: r3 = y; r4 = x;  // sees r3 == 1, r4 == 0

(This outcome is not allowed by SC.)

The IRIW example is a fun one in part because it is not especially
intuitive.  Some people do not at first think that it is a result
forced by SC. I occasionally present this in courses, and most
students' first reaction is that you should use a common lock in all
threads if you want to ensure agreement about order of x and y
here. The fact that you don't need to strikes some (but by no means
all) people as a magical/spooky property of SC.

This example (and variants of it) was also among those first driving
research into more efficient distributed multicast protocols in the
late 80's/early 90's (when I first encountered consistency policies
and protocols).  Maintaining this property of SC is much more
expensive in a distributed setting than other consistency policies
that are sufficient to implement most distributed algorithms.  SC
normally requires blocking on O(#hosts) round-trips per message in the
absence of failure, and heavy (and fallible) failure-recovery
mechanics. Other policies, including "causal broadcast" (guaranteeing
only transitivity of read-write happens-before in producer-consumer
chains) usually don't need to wait out all the round-trips (but still
require buffering).  While the situation is a little better for
multiprocessor/multicore designers, it is not surprising that they
occasionally propose (as did AMD and then Intel five years or so ago)
schemes that are by default weaker (but still with full-SC modes).

Arguments for not giving in to the whinings of implementors include
those claiming that uniform SC requirements enable better tools,
simpler proofs of correctness, more understandable models, and the
reduction of counterintuitive orderings.  And that no single "natural"
property has emerged to replace it, despite a fair amount of trying.

-Doug


From boehm at acm.org  Sun Feb 23 22:52:42 2014
From: boehm at acm.org (Hans Boehm)
Date: Sun, 23 Feb 2014 22:52:42 -0800
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <5309FECD.3000502@cs.oswego.edu>
References: <5308C959.80502@cs.oswego.edu>
	<5309FECD.3000502@cs.oswego.edu>
Message-ID: <CAPUmR1aF54B_3HOzkpdVKRhsgJkUJueP446b+4otqYqiuUR6+g@mail.gmail.com>

I think it's that last comment here that needs to be emphasized: We don't
really have a viable candidate property to replace SC, that's anywhere near
as easy to reason about and provides significant performance advantages.
Several people, including Doug, looked hard for such things when we were
talking about  C++.

As far as I can tell, everyone intuitively wants to reason about thread
behavior in terms of interleaving thread actions, possibly after allowing
some reorderings within threads.  IRIW seems inherently incompatible with
that, which might be a partial explanation of why it's difficult to reason
directly with consistency properties that allow it.

Hans


On Sun, Feb 23, 2014 at 5:59 AM, Doug Lea <dl at cs.oswego.edu> wrote:

> On 02/22/2014 10:59 AM, Doug Lea wrote:
>
>> I won't yet try to summarize different positions and rationales,
>> but for now just invite further discussion.
>>
>
> That was too cowardly. Here's a shot at summarizing some of the
> historical context.
>
>  PS: As a reminder, here's IRIW. Given global x, y:
>>    Thread 1: x = 1;
>>    Thread 2: y = 1;
>>    Thread 3: r1 = x; r2 = y;  // sees r1 == 1, r2 == 0
>>    Thread 4: r3 = y; r4 = x;  // sees r3 == 1, r4 == 0
>>
>
> (This outcome is not allowed by SC.)
>
> The IRIW example is a fun one in part because it is not especially
> intuitive.  Some people do not at first think that it is a result
> forced by SC. I occasionally present this in courses, and most
> students' first reaction is that you should use a common lock in all
> threads if you want to ensure agreement about order of x and y
> here. The fact that you don't need to strikes some (but by no means
> all) people as a magical/spooky property of SC.
>
> This example (and variants of it) was also among those first driving
> research into more efficient distributed multicast protocols in the
> late 80's/early 90's (when I first encountered consistency policies
> and protocols).  Maintaining this property of SC is much more
> expensive in a distributed setting than other consistency policies
> that are sufficient to implement most distributed algorithms.  SC
> normally requires blocking on O(#hosts) round-trips per message in the
> absence of failure, and heavy (and fallible) failure-recovery
> mechanics. Other policies, including "causal broadcast" (guaranteeing
> only transitivity of read-write happens-before in producer-consumer
> chains) usually don't need to wait out all the round-trips (but still
> require buffering).  While the situation is a little better for
> multiprocessor/multicore designers, it is not surprising that they
> occasionally propose (as did AMD and then Intel five years or so ago)
> schemes that are by default weaker (but still with full-SC modes).
>
> Arguments for not giving in to the whinings of implementors include
> those claiming that uniform SC requirements enable better tools,
> simpler proofs of correctness, more understandable models, and the
> reduction of counterintuitive orderings.  And that no single "natural"
> property has emerged to replace it, despite a fair amount of trying.
>
> -Doug
>
>

From dl at cs.oswego.edu  Mon Feb 24 05:00:08 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Mon, 24 Feb 2014 08:00:08 -0500
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <CAPUmR1aF54B_3HOzkpdVKRhsgJkUJueP446b+4otqYqiuUR6+g@mail.gmail.com>
References: <5308C959.80502@cs.oswego.edu>	<5309FECD.3000502@cs.oswego.edu>
	<CAPUmR1aF54B_3HOzkpdVKRhsgJkUJueP446b+4otqYqiuUR6+g@mail.gmail.com>
Message-ID: <530B4258.4030308@cs.oswego.edu>

On 02/24/2014 01:52 AM, Hans Boehm wrote:
> I think it's that last comment here that needs to be emphasized: We don't really
> have a viable candidate property to replace SC, that's anywhere near as easy to
> reason about and provides significant performance advantages. Several people,
> including Doug, looked hard for such things when we were talking about  C++.
>

Yes (plus similar explorations for X10, and distributed consistency).
We are pretty sure that there is no good substitute for requiring
SC for lock-based programs. I think the main issue at hand is
how far SC applies. We cannot require SC for all uses of
mode-based/fenced/volatile accesses, because some sets of
usages clearly are not SC. The audience of people using them
seem happy to rely only on specs of ordering constraints.
So it may suffice to just leave it at that. Although people
do need to know which usages are emergently SC, so that they
can for example build locks, which may require some special
care in specification.

This is just a slightly different perspective on similar issues
and decisions in C/C++11.  Among the differences is that we
have "legacy" mode-less default volatile load/store, for
which it is not clear that requiring uniform SC guarantees
(versus only for get/set Sequential) would be doing anyone
a favor. And not clear that it wouldn't.


While I'm at it...

> On Sun, Feb 23, 2014 at 5:59 AM, Doug Lea <dl at cs.oswego.edu
> <mailto:dl at cs.oswego.edu>> wrote:
>
>     The IRIW example is a fun one in part because it is not especially
>     intuitive.  Some people do not at first think that it is a result
>     forced by SC. I occasionally present this in courses, and most
>     students' first reaction is that you should use a common lock in all
>     threads if you want to ensure agreement about order of x and y
>     here. The fact that you don't need to strikes some (but by no means
>     all) people as a magical/spooky property of SC.

A caveat: When I've done this in courses, there's usually
some student who tries to exploit this to avoid locks/sync
in some programming project. But never correctly -- the example
does not seem to generalize in any useful way. In fact, I have
never seen a program where SC-IRIW matters, so arguably,
most people are better off not even knowing about it :-)

-Doug


From ajeffrey at bell-labs.com  Mon Feb 24 09:19:49 2014
From: ajeffrey at bell-labs.com (Alan Jeffrey)
Date: Mon, 24 Feb 2014 11:19:49 -0600
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <CAPUmR1aF54B_3HOzkpdVKRhsgJkUJueP446b+4otqYqiuUR6+g@mail.gmail.com>
References: <5308C959.80502@cs.oswego.edu> <5309FECD.3000502@cs.oswego.edu>
	<CAPUmR1aF54B_3HOzkpdVKRhsgJkUJueP446b+4otqYqiuUR6+g@mail.gmail.com>
Message-ID: <530B7F35.2070502@bell-labs.com>

The LTL formulation of relaxed consistency does validate IRIW. The 
interesting trace is:

   (1: W x=1) (2: W y=1) (3: Rx=1) (3: Ry=0) (4: R y=1) (4: R x=0)

The reason why this trace is relaxed consistent is that each action can 
be justified by a different permutation of the actions before it. In 
particular, the action (4: x=0) can be justified by the permutation:

   (2: W y=1) (3: Rx=1) (3: Ry=0) (4: R y=1) (4: R x=0) (1: W x=1)

and the action (3: Ry=0) can be justified by the permutation:

   (1: W x=1) (3: Rx=1) (3: Ry=0) (2: W y=1)

So I there are models based on interleaved actions and reorderings that 
validate IRIW, but crucially different reorderings are used to justify 
different read actions.

I'm not going to try to claim that LTL with permutations is as easy to 
reason about as SC though!

A.

On 02/24/2014 12:52 AM, Hans Boehm wrote:
> I think it's that last comment here that needs to be emphasized: We don't
> really have a viable candidate property to replace SC, that's anywhere near
> as easy to reason about and provides significant performance advantages.
> Several people, including Doug, looked hard for such things when we were
> talking about  C++.
>
> As far as I can tell, everyone intuitively wants to reason about thread
> behavior in terms of interleaving thread actions, possibly after allowing
> some reorderings within threads.  IRIW seems inherently incompatible with
> that, which might be a partial explanation of why it's difficult to reason
> directly with consistency properties that allow it.
>
> Hans
>
>
> On Sun, Feb 23, 2014 at 5:59 AM, Doug Lea <dl at cs.oswego.edu> wrote:
>
>> On 02/22/2014 10:59 AM, Doug Lea wrote:
>>
>>> I won't yet try to summarize different positions and rationales,
>>> but for now just invite further discussion.
>>>
>>
>> That was too cowardly. Here's a shot at summarizing some of the
>> historical context.
>>
>>   PS: As a reminder, here's IRIW. Given global x, y:
>>>     Thread 1: x = 1;
>>>     Thread 2: y = 1;
>>>     Thread 3: r1 = x; r2 = y;  // sees r1 == 1, r2 == 0
>>>     Thread 4: r3 = y; r4 = x;  // sees r3 == 1, r4 == 0
>>>
>>
>> (This outcome is not allowed by SC.)
>>
>> The IRIW example is a fun one in part because it is not especially
>> intuitive.  Some people do not at first think that it is a result
>> forced by SC. I occasionally present this in courses, and most
>> students' first reaction is that you should use a common lock in all
>> threads if you want to ensure agreement about order of x and y
>> here. The fact that you don't need to strikes some (but by no means
>> all) people as a magical/spooky property of SC.
>>
>> This example (and variants of it) was also among those first driving
>> research into more efficient distributed multicast protocols in the
>> late 80's/early 90's (when I first encountered consistency policies
>> and protocols).  Maintaining this property of SC is much more
>> expensive in a distributed setting than other consistency policies
>> that are sufficient to implement most distributed algorithms.  SC
>> normally requires blocking on O(#hosts) round-trips per message in the
>> absence of failure, and heavy (and fallible) failure-recovery
>> mechanics. Other policies, including "causal broadcast" (guaranteeing
>> only transitivity of read-write happens-before in producer-consumer
>> chains) usually don't need to wait out all the round-trips (but still
>> require buffering).  While the situation is a little better for
>> multiprocessor/multicore designers, it is not surprising that they
>> occasionally propose (as did AMD and then Intel five years or so ago)
>> schemes that are by default weaker (but still with full-SC modes).
>>
>> Arguments for not giving in to the whinings of implementors include
>> those claiming that uniform SC requirements enable better tools,
>> simpler proofs of correctness, more understandable models, and the
>> reduction of counterintuitive orderings.  And that no single "natural"
>> property has emerged to replace it, despite a fair amount of trying.
>>
>> -Doug
>>
>>

From paulmck at linux.vnet.ibm.com  Mon Feb 24 14:44:25 2014
From: paulmck at linux.vnet.ibm.com (Paul E. McKenney)
Date: Mon, 24 Feb 2014 14:44:25 -0800
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <CAPYFHW0ce66AHwfAw9WUJSAK0CbHfSsP4xwC4bDJswhCCd9huw@mail.gmail.com>
References: <5308C959.80502@cs.oswego.edu>
	<CAPYFHW0ce66AHwfAw9WUJSAK0CbHfSsP4xwC4bDJswhCCd9huw@mail.gmail.com>
Message-ID: <20140224224425.GD8264@linux.vnet.ibm.com>

On Sat, Feb 22, 2014 at 11:58:01AM -0800, Jeremy Manson wrote:
> On Sat, Feb 22, 2014 at 7:59 AM, Doug Lea <dl at cs.oswego.edu> wrote:
> 
> >
> > Another in the continuing series of issues to contemplate:
> >
> > There's a tension between those who believe that all "correct"
> > programs are provably sequentially consistent versus those who
> > consider sequential consistency as a goal only of lock-based programs;
> > not necessarily of those using lock-free techniques and/or are
> > components of distributed systems. (see for example Herlihy & Shavit's
> > "The Art of Multiprocessor Programming"
> > http://store.elsevier.com/The-Art-of-Multiprocessor-
> > Programming/Maurice-Herlihy/isbn-9780080569581/)
> >
> >
> Who falls into the first category?  A "correct" program is one where the
> behavior matches the spec, and if that can be done with non-SC behavior
> (which it often can), then the conversation is over.

Hear, hear!  ;-)

> I think the major limiting factor for volatiles and atomics supporting SC
> (which is how I read what you are asking) is whether it can be done
> reasonably (i.e., with acceptable performance) on the target platforms.  If
> it can, then for everyone's sanity (and in keeping with the desire for Java
> to have somewhat accessible semantics for stuff like this), it makes sense
> to specify them as being SC.  If it can't, then (IMO) the IRIW-alike idioms
> are few and far between enough that it makes no sense to try to decrease
> everyone's performance to support SC for them.

This is my experience as well -- I have seen very few actual algorithms
that relied on SC.

							Thanx, Paul


From paulmck at linux.vnet.ibm.com  Mon Feb 24 14:54:21 2014
From: paulmck at linux.vnet.ibm.com (Paul E. McKenney)
Date: Mon, 24 Feb 2014 14:54:21 -0800
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <5309BA22.9090900@oracle.com>
References: <5308C959.80502@cs.oswego.edu>
 <5309BA22.9090900@oracle.com>
Message-ID: <20140224225421.GE8264@linux.vnet.ibm.com>

On Sun, Feb 23, 2014 at 01:06:42PM +0400, Aleksey Shipilev wrote:
> On 02/22/2014 07:59 PM, Doug Lea wrote:
> > Other cases may be less clear cut.  For the most famous example: Can a
> > program using non-lock-based techniques (for example, using Java
> > volatile loads/stores) be "correct" if it fails some variant of the IRIW
> > test?  Is IRIW conformance an unnecessary action-at-a-distance
> > by-product of SC, or does it play some intrinsically useful role in
> > assuring correctness?
> 
> IMO, we are on a thin ice here. The absence of counter-examples how
> non-SC behaviors for IRIW-like constructions demolish the correctness at
> larger scale does not mean we wouldn't find the case where it breaks
> badly in future, when the spec solidifies. In other words, absence of
> evidence is not evidence of absence.
> 
> I, for one, would not like to wake up to another
> double-checked-locking-like calamity because we allowed a particular
> sneaky behavior in the name of performance. And yes, being the
> performance guy, I still think strong correctness wins over performance
> ten times over.
> 
> The relaxations are welcome, but only in a few very constrained places,
> where you are able to relatively easy fix/rewrite the bad usages or even
> provide stronger ad-hoc semantics. On other words, the things you allow
> in a library (e.g. Linux RCU) are not the things you want to burn into a
> language spec.

Hmmm...  On the one hand, use of SC is no substitute for carefully
designed APIs that are easy to use.  Some of my ugliest bugs in
my Linux-kernel work would not be helped by SC -- they involved
very conservative fully locked code.

On the other hand, if you are using non-SC primitives, then you had
better have a really carefully designed heavily stress-tested API.
A proof of correctness wouldn't hurt either.  ;-)

> > IRIW is not the only example of a case in which SC imposes conditions
> > that some programmers in some contexts seem not to care about.  But
> > it is most famous because it so clearly impacts the nature and cost of
> > mappings (for various modes of load, store, and CAS) on some existing
> > processors as well as potential mappings on future processors.
> 
> Being the language guy, I think the hardware not being able to provide
> the sane SC primitives should pay up the costs. The hardware which makes
> it relatively easy to implement the non-tricky language memory model
> should be in the sweet spot.

All hardware I know of has a non-trivial penalty for its SC primitives,
so there is a place for non-SC algorithms.

							Thanx, Paul


From paulmck at linux.vnet.ibm.com  Mon Feb 24 14:58:30 2014
From: paulmck at linux.vnet.ibm.com (Paul E. McKenney)
Date: Mon, 24 Feb 2014 14:58:30 -0800
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <530B4258.4030308@cs.oswego.edu>
References: <5308C959.80502@cs.oswego.edu> <5309FECD.3000502@cs.oswego.edu>
	<CAPUmR1aF54B_3HOzkpdVKRhsgJkUJueP446b+4otqYqiuUR6+g@mail.gmail.com>
	<530B4258.4030308@cs.oswego.edu>
Message-ID: <20140224225830.GF8264@linux.vnet.ibm.com>

On Mon, Feb 24, 2014 at 08:00:08AM -0500, Doug Lea wrote:
> On 02/24/2014 01:52 AM, Hans Boehm wrote:
> >I think it's that last comment here that needs to be emphasized: We don't really
> >have a viable candidate property to replace SC, that's anywhere near as easy to
> >reason about and provides significant performance advantages. Several people,
> >including Doug, looked hard for such things when we were talking about  C++.
> 
> Yes (plus similar explorations for X10, and distributed consistency).
> We are pretty sure that there is no good substitute for requiring
> SC for lock-based programs. I think the main issue at hand is
> how far SC applies. We cannot require SC for all uses of
> mode-based/fenced/volatile accesses, because some sets of
> usages clearly are not SC. The audience of people using them
> seem happy to rely only on specs of ordering constraints.
> So it may suffice to just leave it at that. Although people
> do need to know which usages are emergently SC, so that they
> can for example build locks, which may require some special
> care in specification.
> 
> This is just a slightly different perspective on similar issues
> and decisions in C/C++11.  Among the differences is that we
> have "legacy" mode-less default volatile load/store, for
> which it is not clear that requiring uniform SC guarantees
> (versus only for get/set Sequential) would be doing anyone
> a favor. And not clear that it wouldn't.

Even I have come to grudgingly accept that SC is a reasonable default.
But I definitely would not want to give up weaker modes.  Something
about needing my code to perform and scale well.  ;-)

> While I'm at it...
> 
> >On Sun, Feb 23, 2014 at 5:59 AM, Doug Lea <dl at cs.oswego.edu
> ><mailto:dl at cs.oswego.edu>> wrote:
> >
> >    The IRIW example is a fun one in part because it is not especially
> >    intuitive.  Some people do not at first think that it is a result
> >    forced by SC. I occasionally present this in courses, and most
> >    students' first reaction is that you should use a common lock in all
> >    threads if you want to ensure agreement about order of x and y
> >    here. The fact that you don't need to strikes some (but by no means
> >    all) people as a magical/spooky property of SC.
> 
> A caveat: When I've done this in courses, there's usually
> some student who tries to exploit this to avoid locks/sync
> in some programming project. But never correctly -- the example
> does not seem to generalize in any useful way. In fact, I have
> never seen a program where SC-IRIW matters, so arguably,
> most people are better off not even knowing about it :-)

I heard a rumor that some work-stealing task scheduler relied on SC-IRIW,
but never have been able to track it down.  Even if someone does track
it down, I would argue that it is the exception that proves the rule.  ;-)

							Thanx, Paul


From paulmck at linux.vnet.ibm.com  Mon Feb 24 16:20:20 2014
From: paulmck at linux.vnet.ibm.com (Paul E. McKenney)
Date: Mon, 24 Feb 2014 16:20:20 -0800
Subject: [jmm-dev] stores
In-Reply-To: <5301430E.9010009@cs.oswego.edu>
References: <5301430E.9010009@cs.oswego.edu>
Message-ID: <20140225002020.GI8264@linux.vnet.ibm.com>

On Sun, Feb 16, 2014 at 06:00:30PM -0500, Doug Lea wrote:
> 
> Memory models can generate a fair amount of excitement.
> See Linus Torvalds's post on the linux kernel list:
>   https://lkml.org/lkml/2014/2/14/492
> and follow-ups with Paul McKenney. (Condolences!)

Heh!  The fun continues...

> I don't think this introduces anything new with respect
> to JMM9 discussions so far though. In general, speculative
> stores and out-of-thin-air reads break basic safety properties.
> Although there still might be some related open cases
> about inserted stores, including "redundant" ones. As in:
>     if (x != 0) x = 0;
> ==> x = 0; ?

One possibly interesting thing from later in the LKML discussion, though
I am not sure that it maps into the Java final-field model -- you guys
can be the judge of that.  I will present it in C just for definiteness.

T1:	p = &nondefault_gp;
	p->a = 42;
	atomic_store_release(&gp, p);

T2:	p = atomic_load_explicit(&gp, memory_order_consume);
	if (p != &default_gp) {
		do_something_with(p);
		return;
	}
	r1 = p->a;  /* At this point, the compiler knows p == &default_gp. */

If this particular execution has only &default_gp and &nondefault_gp
as values for gp, are we guaranteed that r1==42?  It would be given
the current wording in the C11 and C++11 standards.  Assuming that this
example even makes sense in the context of Java final fields...

							Thanx, Paul


From dl at cs.oswego.edu  Tue Feb 25 04:26:45 2014
From: dl at cs.oswego.edu (Doug Lea)
Date: Tue, 25 Feb 2014 07:26:45 -0500
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <20140224224425.GD8264@linux.vnet.ibm.com>
References: <5308C959.80502@cs.oswego.edu>
	<CAPYFHW0ce66AHwfAw9WUJSAK0CbHfSsP4xwC4bDJswhCCd9huw@mail.gmail.com>
	<20140224224425.GD8264@linux.vnet.ibm.com>
Message-ID: <530C8C05.5070008@cs.oswego.edu>

On 02/24/2014 05:44 PM, Paul E. McKenney wrote:
> This is my experience as well -- I have seen very few actual algorithms
> that relied on SC.

This seems to be the attitude of almost all developers of
non-lock-based algorithms: Explicit ordering constraints are
critical, but program-wide SC is not. Which is nearly opposite
to almost every developer's view of lock-based programs: any ordering
is OK so long as SC is maintained.

One place these different views meet up is when creating locks
out of non-blocking primitives. So there must be guaranteed
ways of achieving SC using modeful/fenced accesses.

Beyond that, the problem seems underconstrained.

I'm not sure that litmus-test-style examples will suffice
to provide an answer. When you are not dealing with locks,
it seems that for every odd consequence of some non-SC rule,
you can find an equally odd one for an SC-based rule.
For example, Ali Sezgin (who is on this list) has written up some
especially bizarre sequentially consistent examples in:
Sezgin, Ali, and Ganesh Gopalakrishnan. "On the definition of sequential 
consistency." Information processing letters 2005.
http://www.cs.utah.edu/formal_verification/publications/june2013update/dblp/2005/2/j23.pdf

-Doug


From paulmck at linux.vnet.ibm.com  Tue Feb 25 10:53:08 2014
From: paulmck at linux.vnet.ibm.com (Paul E. McKenney)
Date: Tue, 25 Feb 2014 10:53:08 -0800
Subject: [jmm-dev] Sequential Consistency
In-Reply-To: <530C8C05.5070008@cs.oswego.edu>
References: <5308C959.80502@cs.oswego.edu>
	<CAPYFHW0ce66AHwfAw9WUJSAK0CbHfSsP4xwC4bDJswhCCd9huw@mail.gmail.com>
	<20140224224425.GD8264@linux.vnet.ibm.com>
	<530C8C05.5070008@cs.oswego.edu>
Message-ID: <20140225185307.GV8264@linux.vnet.ibm.com>

On Tue, Feb 25, 2014 at 07:26:45AM -0500, Doug Lea wrote:
> On 02/24/2014 05:44 PM, Paul E. McKenney wrote:
> >This is my experience as well -- I have seen very few actual algorithms
> >that relied on SC.
> 
> This seems to be the attitude of almost all developers of
> non-lock-based algorithms: Explicit ordering constraints are
> critical, but program-wide SC is not. Which is nearly opposite
> to almost every developer's view of lock-based programs: any ordering
> is OK so long as SC is maintained.

I am quite capable of maintaining both viewpoints internally.
If I am using locks, I want the benefits of locking.  When I am
not using locks, I don't want to be forced to wear the locking
straightjacket.  ;-)

> One place these different views meet up is when creating locks
> out of non-blocking primitives. So there must be guaranteed
> ways of achieving SC using modeful/fenced accesses.

Yep.

> Beyond that, the problem seems underconstrained.
> 
> I'm not sure that litmus-test-style examples will suffice
> to provide an answer. When you are not dealing with locks,
> it seems that for every odd consequence of some non-SC rule,
> you can find an equally odd one for an SC-based rule.
> For example, Ali Sezgin (who is on this list) has written up some
> especially bizarre sequentially consistent examples in:
> Sezgin, Ali, and Ganesh Gopalakrishnan. "On the definition of
> sequential consistency." Information processing letters 2005.
> http://www.cs.utah.edu/formal_verification/publications/june2013update/dblp/2005/2/j23.pdf

I had not seen this one before!  Classic!!!  ;-)

							Thanx, Paul