From david.holmes at oracle.com  Thu Nov 10 00:06:34 2016
From: david.holmes at oracle.com (David Holmes)
Date: Thu, 10 Nov 2016 10:06:34 +1000
Subject: [jmm-dev] Store completion query - general and ARM
Message-ID: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com>

Does any part of the JMM require actual visibility/completion of 
volatile stores or is it only order that is defined (with an assumptions 
that all stores will complete in a finite time)?

In relation to ARM specifically, Dekker style algorithms require 
visibility/completion of the store before the subsequent load, yet the 
example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory 
Models" shows the use of DMB, not DSB. Yet AFAICS DMB says nothing about 
completion whereas DSB does. ?? (To be honest I find the Group A/B 
description of DMB properties extremely hard to actually interpret wrt 
code like Dekker.)

Thanks,
David

From aph at redhat.com  Thu Nov 10 09:07:15 2016
From: aph at redhat.com (Andrew Haley)
Date: Thu, 10 Nov 2016 09:07:15 +0000
Subject: [jmm-dev] Store completion query - general and ARM
In-Reply-To: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com>
References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com>
Message-ID: <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com>

On 10/11/16 00:06, David Holmes wrote:
> Does any part of the JMM require actual visibility/completion of 
> volatile stores or is it only order that is defined (with an assumptions 
> that all stores will complete in a finite time)?

Ordering is really all that we've got: all that memory fences can do
is ensure that visibility of loads and stores is ordered in some way.

> In relation to ARM specifically, Dekker style algorithms require 
> visibility/completion of the store before the subsequent load, yet the 
> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory 
> Models" shows the use of DMB, not DSB.

DMB is fine for that.  Dekker doesn't need a store to be forced out of
the caches, only that the store be made visible to other processors
before any operations later in program order.

> Yet AFAICS DMB says nothing about completion whereas DSB does. ??
> (To be honest I find the Group A/B description of DMB properties
> extremely hard to actually interpret wrt code like Dekker.)

DSB is only really needed if there are multiple caches of the same
address, i.e. Icache and Dcache: it's necessary to force a store out
into main memory in order to refresh he Icache.

Andrew.

From david.holmes at oracle.com  Thu Nov 10 09:20:01 2016
From: david.holmes at oracle.com (David Holmes)
Date: Thu, 10 Nov 2016 19:20:01 +1000
Subject: [jmm-dev] Store completion query - general and ARM
In-Reply-To: <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com>
References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com>
	<9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com>
Message-ID: <93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com>

On 10/11/2016 7:07 PM, Andrew Haley wrote:
> On 10/11/16 00:06, David Holmes wrote:
>> Does any part of the JMM require actual visibility/completion of
>> volatile stores or is it only order that is defined (with an assumptions
>> that all stores will complete in a finite time)?
>
> Ordering is really all that we've got: all that memory fences can do
> is ensure that visibility of loads and stores is ordered in some way.

If we establish some global order of loads and stores, yes. That can in 
turn require that a store become visible prior to a given load.

>> In relation to ARM specifically, Dekker style algorithms require
>> visibility/completion of the store before the subsequent load, yet the
>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory
>> Models" shows the use of DMB, not DSB.
>
> DMB is fine for that.  Dekker doesn't need a store to be forced out of
> the caches, only that the store be made visible to other processors
> before any operations later in program order.

Again it is far from obvious to me that DMB causes the store to be 
visible before any operations later in program order. I find the Group A 
/ Group B formulation (and even the definition of "observe") to be quite 
obscure and hard to map to actual code behaviour.

>> Yet AFAICS DMB says nothing about completion whereas DSB does. ??
>> (To be honest I find the Group A/B description of DMB properties
>> extremely hard to actually interpret wrt code like Dekker.)
>
> DSB is only really needed if there are multiple caches of the same
> address, i.e. Icache and Dcache: it's necessary to force a store out
> into main memory in order to refresh he Icache.

I thought only ISB had an effect relative to instructions/i-cache ??

Thanks,
David

> Andrew.
>

From aph at redhat.com  Thu Nov 10 09:31:09 2016
From: aph at redhat.com (Andrew Haley)
Date: Thu, 10 Nov 2016 09:31:09 +0000
Subject: [jmm-dev] Store completion query - general and ARM
In-Reply-To: <93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com>
References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com>
	<9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com>
	<93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com>
Message-ID: <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com>

On 10/11/16 09:20, David Holmes wrote:
> On 10/11/2016 7:07 PM, Andrew Haley wrote:
>> On 10/11/16 00:06, David Holmes wrote:
>>> Does any part of the JMM require actual visibility/completion of
>>> volatile stores or is it only order that is defined (with an assumptions
>>> that all stores will complete in a finite time)?
>>
>> Ordering is really all that we've got: all that memory fences can do
>> is ensure that visibility of loads and stores is ordered in some way.
> 
> If we establish some global order of loads and stores, yes. That can in 
> turn require that a store become visible prior to a given load.

I agree.

>>> In relation to ARM specifically, Dekker style algorithms require
>>> visibility/completion of the store before the subsequent load, yet the
>>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory
>>> Models" shows the use of DMB, not DSB.
>>
>> DMB is fine for that.  Dekker doesn't need a store to be forced out of
>> the caches, only that the store be made visible to other processors
>> before any operations later in program order.
> 
> Again it is far from obvious to me that DMB causes the store to be 
> visible before any operations later in program order. I find the Group A 
> / Group B formulation (and even the definition of "observe") to be quite 
> obscure and hard to map to actual code behaviour.

Indeed.  The real problem is that ARM are trying to describe the
memory model in an abstract way that does not overly constrain
implementations.  But a DMB really is sufficient to ensure that prior
stores are visible.  (Mind you, we don't need DMB to get sequentially-
consistent behaviour that's enough for Java volatiles.)

>>> Yet AFAICS DMB says nothing about completion whereas DSB does. ??
>>> (To be honest I find the Group A/B description of DMB properties
>>> extremely hard to actually interpret wrt code like Dekker.)
>>
>> DSB is only really needed if there are multiple caches of the same
>> address, i.e. Icache and Dcache: it's necessary to force a store out
>> into main memory in order to refresh he Icache.
> 
> I thought only ISB had an effect relative to instructions/i-cache ??

It does: you need DSB to ensure the visibility of the data cleaned
from the Dcache, then ISB to synchronize the fetched instruction
stream.

Andrew.


From Peter.Sewell at cl.cam.ac.uk  Thu Nov 10 09:54:00 2016
From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell)
Date: Thu, 10 Nov 2016 09:54:00 +0000
Subject: [jmm-dev] Store completion query - general and ARM
In-Reply-To: <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com>
References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com>
	<9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com>
	<93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com>
	<65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com>
Message-ID: <CAHWkzRQCaok5ny_b7p4f2CDSRfpqoqYDR9XXooyjNDcLR3K3Xw@mail.gmail.com>

If you want a more rigorous and concrete model that explains this, you
might want to look at:
http://www.cl.cam.ac.uk/~pes20/popl16-armv8/top.pdf

The associated tool: www.cl.cam.ac.uk/~pes20/AArch64/
lets one run arbitrary model-allowed executions of tests:
- select AArch64 test Tutorial/SB+dmb.sys
- under Options, select "a larger set of transitions which we proved can be
taken eagerly"
- click Run, then it shows the initial state of the model with the initial
transitions highlighted in green
- take all the thread-local transitions of each thread (five each)
- now you can see each thread's write, dmb, and read request in the
"flowing model" abstract storage subsystem
- in this model, the dmb sys keeps the write and read request in order as
they flow down to memory, so no interleaving of the possible model
transitions can break the Dekker's algorithm property.

In this example, it happens that the read requests also can't be satisfied
from writes that haven't hit main memory, but in general they can be
satisfied earlier.

For contrast, if you try the SB test without dmb, you'll many more possible
executions.

This flowing model is actually a bit more microarchitectural than one would
like for an architectural spec, as it exposes the abstract interconnect
topology.  The POP model, also provided by that tool, abstracts from the
topology.  Both are principally due to Shaked Flur, cc'd.

best,
Peter


On 10 November 2016 at 09:31, Andrew Haley <aph at redhat.com> wrote:

> On 10/11/16 09:20, David Holmes wrote:
> > On 10/11/2016 7:07 PM, Andrew Haley wrote:
> >> On 10/11/16 00:06, David Holmes wrote:
> >>> Does any part of the JMM require actual visibility/completion of
> >>> volatile stores or is it only order that is defined (with an
> assumptions
> >>> that all stores will complete in a finite time)?
> >>
> >> Ordering is really all that we've got: all that memory fences can do
> >> is ensure that visibility of loads and stores is ordered in some way.
> >
> > If we establish some global order of loads and stores, yes. That can in
> > turn require that a store become visible prior to a given load.
>
> I agree.
>
> >>> In relation to ARM specifically, Dekker style algorithms require
> >>> visibility/completion of the store before the subsequent load, yet the
> >>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory
> >>> Models" shows the use of DMB, not DSB.
> >>
> >> DMB is fine for that.  Dekker doesn't need a store to be forced out of
> >> the caches, only that the store be made visible to other processors
> >> before any operations later in program order.
> >
> > Again it is far from obvious to me that DMB causes the store to be
> > visible before any operations later in program order. I find the Group A
> > / Group B formulation (and even the definition of "observe") to be quite
> > obscure and hard to map to actual code behaviour.
>
> Indeed.  The real problem is that ARM are trying to describe the
> memory model in an abstract way that does not overly constrain
> implementations.  But a DMB really is sufficient to ensure that prior
> stores are visible.  (Mind you, we don't need DMB to get sequentially-
> consistent behaviour that's enough for Java volatiles.)
>
> >>> Yet AFAICS DMB says nothing about completion whereas DSB does. ??
> >>> (To be honest I find the Group A/B description of DMB properties
> >>> extremely hard to actually interpret wrt code like Dekker.)
> >>
> >> DSB is only really needed if there are multiple caches of the same
> >> address, i.e. Icache and Dcache: it's necessary to force a store out
> >> into main memory in order to refresh he Icache.
> >
> > I thought only ISB had an effect relative to instructions/i-cache ??
>
> It does: you need DSB to ensure the visibility of the data cleaned
> from the Dcache, then ISB to synchronize the fetched instruction
> stream.
>
> Andrew.
>
>
>

From david.holmes at oracle.com  Thu Nov 10 20:49:45 2016
From: david.holmes at oracle.com (David Holmes)
Date: Fri, 11 Nov 2016 06:49:45 +1000
Subject: [jmm-dev] Store completion query - general and ARM
In-Reply-To: <CAHWkzRQCaok5ny_b7p4f2CDSRfpqoqYDR9XXooyjNDcLR3K3Xw@mail.gmail.com>
References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com>
	<9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com>
	<93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com>
	<65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com>
	<CAHWkzRQCaok5ny_b7p4f2CDSRfpqoqYDR9XXooyjNDcLR3K3Xw@mail.gmail.com>
Message-ID: <589e836f-0c87-d64c-2196-0ca4d65e2968@oracle.com>

On 10/11/2016 7:54 PM, Peter Sewell wrote:
> If you want a more rigorous and concrete model that explains this, you
> might want to look at:
> http://www.cl.cam.ac.uk/~pes20/popl16-armv8/top.pdf

Thanks Peter, I will take a look at this.

David

> The associated tool: www.cl.cam.ac.uk/~pes20/AArch64/
> <http://www.cl.cam.ac.uk/~pes20/AArch64/>
> lets one run arbitrary model-allowed executions of tests:
> - select AArch64 test Tutorial/SB+dmb.sys
> - under Options, select "a larger set of transitions which we proved can
> be taken eagerly"
> - click Run, then it shows the initial state of the model with the
> initial transitions highlighted in green
> - take all the thread-local transitions of each thread (five each)
> - now you can see each thread's write, dmb, and read request in the
> "flowing model" abstract storage subsystem
> - in this model, the dmb sys keeps the write and read request in order
> as they flow down to memory, so no interleaving of the possible model
> transitions can break the Dekker's algorithm property.
>
> In this example, it happens that the read requests also can't be
> satisfied from writes that haven't hit main memory, but in general they
> can be satisfied earlier.
>
> For contrast, if you try the SB test without dmb, you'll many more
> possible executions.
>
> This flowing model is actually a bit more microarchitectural than one
> would like for an architectural spec, as it exposes the abstract
> interconnect topology.  The POP model, also provided by that tool,
> abstracts from the topology.  Both are principally due to Shaked Flur, cc'd.
>
> best,
> Peter
>
>
> On 10 November 2016 at 09:31, Andrew Haley <aph at redhat.com
> <mailto:aph at redhat.com>> wrote:
>
>     On 10/11/16 09:20, David Holmes wrote:
>     > On 10/11/2016 7:07 PM, Andrew Haley wrote:
>     >> On 10/11/16 00:06, David Holmes wrote:
>     >>> Does any part of the JMM require actual visibility/completion of
>     >>> volatile stores or is it only order that is defined (with an assumptions
>     >>> that all stores will complete in a finite time)?
>     >>
>     >> Ordering is really all that we've got: all that memory fences can do
>     >> is ensure that visibility of loads and stores is ordered in some way.
>     >
>     > If we establish some global order of loads and stores, yes. That can in
>     > turn require that a store become visible prior to a given load.
>
>     I agree.
>
>     >>> In relation to ARM specifically, Dekker style algorithms require
>     >>> visibility/completion of the store before the subsequent load, yet the
>     >>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory
>     >>> Models" shows the use of DMB, not DSB.
>     >>
>     >> DMB is fine for that.  Dekker doesn't need a store to be forced out of
>     >> the caches, only that the store be made visible to other processors
>     >> before any operations later in program order.
>     >
>     > Again it is far from obvious to me that DMB causes the store to be
>     > visible before any operations later in program order. I find the Group A
>     > / Group B formulation (and even the definition of "observe") to be quite
>     > obscure and hard to map to actual code behaviour.
>
>     Indeed.  The real problem is that ARM are trying to describe the
>     memory model in an abstract way that does not overly constrain
>     implementations.  But a DMB really is sufficient to ensure that prior
>     stores are visible.  (Mind you, we don't need DMB to get sequentially-
>     consistent behaviour that's enough for Java volatiles.)
>
>     >>> Yet AFAICS DMB says nothing about completion whereas DSB does. ??
>     >>> (To be honest I find the Group A/B description of DMB properties
>     >>> extremely hard to actually interpret wrt code like Dekker.)
>     >>
>     >> DSB is only really needed if there are multiple caches of the same
>     >> address, i.e. Icache and Dcache: it's necessary to force a store out
>     >> into main memory in order to refresh he Icache.
>     >
>     > I thought only ISB had an effect relative to instructions/i-cache ??
>
>     It does: you need DSB to ensure the visibility of the data cleaned
>     from the Dcache, then ISB to synchronize the fetched instruction
>     stream.
>
>     Andrew.
>
>
>

From david.holmes at oracle.com  Thu Nov 10 20:52:51 2016
From: david.holmes at oracle.com (David Holmes)
Date: Fri, 11 Nov 2016 06:52:51 +1000
Subject: [jmm-dev] Store completion query - general and ARM
In-Reply-To: <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com>
References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com>
	<9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com>
	<93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com>
	<65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com>
Message-ID: <76c7fbfd-da6a-9a9c-28ca-c1824c6c030f@oracle.com>


On 10/11/2016 7:31 PM, Andrew Haley wrote:
> On 10/11/16 09:20, David Holmes wrote:
>> On 10/11/2016 7:07 PM, Andrew Haley wrote:
>>> On 10/11/16 00:06, David Holmes wrote:
>>>> Does any part of the JMM require actual visibility/completion of
>>>> volatile stores or is it only order that is defined (with an assumptions
>>>> that all stores will complete in a finite time)?
>>>
>>> Ordering is really all that we've got: all that memory fences can do
>>> is ensure that visibility of loads and stores is ordered in some way.
>>
>> If we establish some global order of loads and stores, yes. That can in
>> turn require that a store become visible prior to a given load.
>
> I agree.
>
>>>> In relation to ARM specifically, Dekker style algorithms require
>>>> visibility/completion of the store before the subsequent load, yet the
>>>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory
>>>> Models" shows the use of DMB, not DSB.
>>>
>>> DMB is fine for that.  Dekker doesn't need a store to be forced out of
>>> the caches, only that the store be made visible to other processors
>>> before any operations later in program order.
>>
>> Again it is far from obvious to me that DMB causes the store to be
>> visible before any operations later in program order. I find the Group A
>> / Group B formulation (and even the definition of "observe") to be quite
>> obscure and hard to map to actual code behaviour.
>
> Indeed.  The real problem is that ARM are trying to describe the
> memory model in an abstract way that does not overly constrain

I just wish they had included the word "complete" or "visible" in that 
abstract description. :)

> implementations.  But a DMB really is sufficient to ensure that prior
> stores are visible.  (Mind you, we don't need DMB to get sequentially-
> consistent behaviour that's enough for Java volatiles.)

I was going to ask how that can be true, but then saw this in the paper 
Peter referenced:

"According to the ARM ARM, store-release is multicopy-
atomic when observed by load-acquires, a strong property
that conventional release-acquire semantics does not imply. Furthermore,
despite their names, these instructions are intended to be
used to implement the C11 sequentially consistent load and store."

That is new information to me, and somewhat surprising.

Thanks,
David

>
>>>> Yet AFAICS DMB says nothing about completion whereas DSB does. ??
>>>> (To be honest I find the Group A/B description of DMB properties
>>>> extremely hard to actually interpret wrt code like Dekker.)
>>>
>>> DSB is only really needed if there are multiple caches of the same
>>> address, i.e. Icache and Dcache: it's necessary to force a store out
>>> into main memory in order to refresh he Icache.
>>
>> I thought only ISB had an effect relative to instructions/i-cache ??
>
> It does: you need DSB to ensure the visibility of the data cleaned
> from the Dcache, then ISB to synchronize the fetched instruction
> stream.
>
> Andrew.
>
>

From aph at redhat.com  Tue Nov 15 11:43:27 2016
From: aph at redhat.com (Andrew Haley)
Date: Tue, 15 Nov 2016 11:43:27 +0000
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
Message-ID: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>

http://g.oswego.edu/dl/jmm/cookbook.html says:

... the special final-field rule requiring a StoreStore barrier in
      x.finalField = v; StoreStore; sharedRef = x;

but http://www.hboehm.info/c++mm/no_write_fences.html says:

... it is also generally unsafe to restrict the release ordering
   constraint in thread 1 to only stores. To see this, consider what
   happens if the initialization of x also reads x

I am convinced by Hans Boehm's argument in the second reference, and I
believe that only to use a StoreStore fence is too fragile unless you
disallow some optimizations.

Thread 1:

class X {
      int x;

      X() {
          a = 0;
          a++;
      }
}

void publish() {
    X x = new X();
}

Thread 2:

    x.a = 42;

This is safe enough at the Java level, but inlining of constructors at
the machine level mean that it's hard to guarantee without a LoadStore
at the end of the constructor.  On AArch64 at least we have address
dependency ordering from a load to a memory op based on it, which is
adequate in this case, I think.

I'd prefer to simply have an adjudication that we need a release
barrier at the end of a constructor, but mostly I'd like some sort of
decision.

Thanks,

Andrew.

From aph at redhat.com  Tue Nov 15 13:53:22 2016
From: aph at redhat.com (Andrew Haley)
Date: Tue, 15 Nov 2016 13:53:22 +0000
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
Message-ID: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>

It's been pointed out to me that my example doesn't have a final
field!  It would perhaps have been better not to provide an example,
so rather than muddy the water any further I'll let the question
stand.

Andrew.

From boehm at acm.org  Tue Nov 15 18:44:01 2016
From: boehm at acm.org (Hans Boehm)
Date: Tue, 15 Nov 2016 10:44:01 -0800
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
Message-ID: <CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>

I think this is actually OK for final fields, since no other thread can
write them, and hence reads in the constructor can't really see a write by
another thread.

I continue to believe that we should not generalize the final field
behavior to non-final fields, at least not without generalizing the
constructor barrier to also include LoadStore. Which I think means we're
kind of in agreement. If we did so, and programmers took advantage of that,
it would also mean that constructor() { non_final_field = 0; assert
non_final_field == 0; } could reasonably fail, which seems bad.

Generalizing final field memory ordering to non-final fields also has
optimization consequences on the reader side that we're still struggling
with for C++.

For example, on any flavor of ARM or Power, in

tmp = x;
...
tmp2 = y;
if (tmp == tmp2) {
    tmp3 = tmp2.a;
}

the last assignment can no longer be replaced by tmp3 = tmp.a, because that
wouldn't preserve ordering between the load of y and that of a. (I suspect
that such a replacement can be beneficial if the branch can be correctly
predicted, since tmp may be available earlier.)

Presumably similar rules already apply to final field optimization.  I have
no idea whether existing Java compilers actually make such distinctions.

On Tue, Nov 15, 2016 at 5:53 AM, Andrew Haley <aph at redhat.com> wrote:

> It's been pointed out to me that my example doesn't have a final
> field!  It would perhaps have been better not to provide an example,
> so rather than muddy the water any further I'll let the question
> stand.
>
> Andrew.
>

From paulmck at linux.vnet.ibm.com  Tue Nov 15 19:12:48 2016
From: paulmck at linux.vnet.ibm.com (Paul E. McKenney)
Date: Tue, 15 Nov 2016 11:12:48 -0800
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
Message-ID: <20161115191248.GA3612@linux.vnet.ibm.com>

For whatever it might be worth, we made a similar change in the Linux
kernel some time back.  The rcu_assign_pointer() macro used to contain
a store-store fence, but was upgraded to a store-release of the new
pointer value about 3 years ago in the 3.15 release.

							Thanx, Paul

On Tue, Nov 15, 2016 at 10:44:01AM -0800, Hans Boehm wrote:
> I think this is actually OK for final fields, since no other thread can
> write them, and hence reads in the constructor can't really see a write by
> another thread.
> 
> I continue to believe that we should not generalize the final field
> behavior to non-final fields, at least not without generalizing the
> constructor barrier to also include LoadStore. Which I think means we're
> kind of in agreement. If we did so, and programmers took advantage of that,
> it would also mean that constructor() { non_final_field = 0; assert
> non_final_field == 0; } could reasonably fail, which seems bad.
> 
> Generalizing final field memory ordering to non-final fields also has
> optimization consequences on the reader side that we're still struggling
> with for C++.
> 
> For example, on any flavor of ARM or Power, in
> 
> tmp = x;
> ...
> tmp2 = y;
> if (tmp == tmp2) {
>     tmp3 = tmp2.a;
> }
> 
> the last assignment can no longer be replaced by tmp3 = tmp.a, because that
> wouldn't preserve ordering between the load of y and that of a. (I suspect
> that such a replacement can be beneficial if the branch can be correctly
> predicted, since tmp may be available earlier.)
> 
> Presumably similar rules already apply to final field optimization.  I have
> no idea whether existing Java compilers actually make such distinctions.
> 
> On Tue, Nov 15, 2016 at 5:53 AM, Andrew Haley <aph at redhat.com> wrote:
> 
> > It's been pointed out to me that my example doesn't have a final
> > field!  It would perhaps have been better not to provide an example,
> > so rather than muddy the water any further I'll let the question
> > stand.
> >
> > Andrew.
> >
> 


From dl at cs.oswego.edu  Tue Nov 15 20:19:22 2016
From: dl at cs.oswego.edu (Doug Lea)
Date: Tue, 15 Nov 2016 15:19:22 -0500
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
Message-ID: <e6fa46e5-bdc7-9196-8675-413d9b2488ca@cs.oswego.edu>

On 11/15/2016 06:43 AM, Andrew Haley wrote:
> http://g.oswego.edu/dl/jmm/cookbook.html says:
>
> ... the special final-field rule requiring a StoreStore barrier in
>       x.finalField = v; StoreStore; sharedRef = x;

Note that the fence be placed any time after write of final field and
before return from constructor. In practice, all JVMs I know place a
fence immediately before return if any field is final, covering this
requirement in a simple way.  Including odd cases like programs that
assign a final field twice in a  constructor, which isn't illegal.
(Most people think it ought to be illegal, but too late for that.)

>
> but http://www.hboehm.info/c++mm/no_write_fences.html says:
>
> ... it is also generally unsafe to restrict the release ordering
>    constraint in thread 1 to only stores. To see this, consider what
>    happens if the initialization of x also reads x

As Andrew mentioned, this discussion is not about analogs of
final fields, but instead about cases where fields can be (re)-written
by consumers. As Hans stated in the subsequent section of that
document (and agreed to by others in a few brief exchanges about this on 
this list in 2014), "In this case, it appears to be safe ...".
Which is not to say devoid of all possible surprises.
But it is surely sufficient with respect to the basic
security issues that are the primary motivation for
special rules for final fields.

[...omitted unrelated example...]

>
> I'd prefer to simply have an adjudication that we need a release
> barrier at the end of a constructor, but mostly I'd like some sort of
> decision.
>

If processors intrinsically performed a releaseFence whenever asked for
just a storeStoreFence, it might be defensible to simplify rules to
just say release here.  But this isn't so on ARM and possibly others.

-Doug


From dl at cs.oswego.edu  Wed Nov 16 12:56:29 2016
From: dl at cs.oswego.edu (Doug Lea)
Date: Wed, 16 Nov 2016 07:56:29 -0500
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
Message-ID: <a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>

On 11/15/2016 01:44 PM, Hans Boehm wrote:

> Generalizing final field memory ordering to non-final fields also has
> optimization consequences on the reader side that we're still struggling
> with for C++.
>
> For example, on any flavor of ARM or Power, in
>
> tmp = x;
> ...
> tmp2 = y;
> if (tmp == tmp2) {
>     tmp3 = tmp2.a;
> }
>
> the last assignment can no longer be replaced by tmp3 = tmp.a, because that
> wouldn't preserve ordering between the load of y and that of a. (I suspect
> that such a replacement can be beneficial if the branch can be correctly
> predicted, since tmp may be available earlier.)
>
> Presumably similar rules already apply to final field optimization.

If Tmp.a is final, both the tmp and tmp2 reads are possible only
after tmp.a is (finally) set, so the optimization is OK.
(This requires that there be no address speculation for "new" objects.
Otherwise all sorts of Java security properties would be broken.)

-Doug


From email at pitr.ch  Thu Nov 17 23:41:01 2016
From: email at pitr.ch (Petr Chalupa)
Date: Fri, 18 Nov 2016 00:41:01 +0100
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
Message-ID: <CAGv9LnPhTEC2mHUWE-mRM7-htZgLr4qqvntQkY=oBgUwzGkxRQ@mail.gmail.com>

Hello,

If there is only StoreStore barrier at the end of a constructor then
following code concerns me:

// Thread 1:

class X {
      static X instance;
      final int a;
      int b;

      X() {
          a = 0;
          a++;
          b = 10
          a += b; // could read 42?
      }
}

void publish() {
    X.instance = new X();
}

// Thread 2:
X.instance.b = 42;

Could the read of b in constructor see 42? If it can, a StoreLoad might be
required as well.
Could you confirm or explain where my thought was wrong. Thanks.

Best regards,
Petr Chalupa

On Wed, Nov 16, 2016 at 1:56 PM, Doug Lea <dl at cs.oswego.edu> wrote:

> On 11/15/2016 01:44 PM, Hans Boehm wrote:
>
> Generalizing final field memory ordering to non-final fields also has
>> optimization consequences on the reader side that we're still struggling
>> with for C++.
>>
>> For example, on any flavor of ARM or Power, in
>>
>> tmp = x;
>> ...
>> tmp2 = y;
>> if (tmp == tmp2) {
>>     tmp3 = tmp2.a;
>> }
>>
>> the last assignment can no longer be replaced by tmp3 = tmp.a, because
>> that
>> wouldn't preserve ordering between the load of y and that of a. (I suspect
>> that such a replacement can be beneficial if the branch can be correctly
>> predicted, since tmp may be available earlier.)
>>
>> Presumably similar rules already apply to final field optimization.
>>
>
> If Tmp.a is final, both the tmp and tmp2 reads are possible only
> after tmp.a is (finally) set, so the optimization is OK.
> (This requires that there be no address speculation for "new" objects.
> Otherwise all sorts of Java security properties would be broken.)
>
> -Doug
>
>
>
>
>
>

From dl at cs.oswego.edu  Fri Nov 18 00:12:22 2016
From: dl at cs.oswego.edu (Doug Lea)
Date: Thu, 17 Nov 2016 19:12:22 -0500
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <CAGv9LnPhTEC2mHUWE-mRM7-htZgLr4qqvntQkY=oBgUwzGkxRQ@mail.gmail.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
	<CAGv9LnPhTEC2mHUWE-mRM7-htZgLr4qqvntQkY=oBgUwzGkxRQ@mail.gmail.com>
Message-ID: <f178c06e-ac27-c920-dc99-aaec3f54523b@cs.oswego.edu>

On 11/17/2016 06:41 PM, Petr Chalupa wrote:
> Hello,
>
> If there is only StoreStore barrier at the end of a constructor then
> following code concerns me:

There are several ill-advised things people can do in constructors that
cause the base final field guarantee to be useless. Most famously,
publishing "this" before assigning the field.

static C global;
class C {
   final int a;
   C (int a) { global = this; this.a = a; }
}

And as your example shows, initializing a final with the result of
a computation reading a non-final field is also a bad idea.
There are probably others too, all of which one hopes any concurrent
programmer can see are too crazy to do. (And which good tools would
help point out.)

-Doug

>
> // Thread 1:
>
> class X {
>       static X instance;
>       final int a;
>       int b;
>
>       X() {
>           a = 0;
>           a++;
>           b = 10
>           a += b; // could read 42?
>       }
> }
>
> void publish() {
>     X.instance = new X();
> }
>
> // Thread 2:
> X.instance.b = 42;
>
> Could the read of b in constructor see 42? If it can, a StoreLoad might
> be required as well.
> Could you confirm or explain where my thought was wrong. Thanks.
>
> Best regards,
> Petr Chalupa
>
> On Wed, Nov 16, 2016 at 1:56 PM, Doug Lea <dl at cs.oswego.edu
> <mailto:dl at cs.oswego.edu>> wrote:
>
>     On 11/15/2016 01:44 PM, Hans Boehm wrote:
>
>         Generalizing final field memory ordering to non-final fields
>         also has
>         optimization consequences on the reader side that we're still
>         struggling
>         with for C++.
>
>         For example, on any flavor of ARM or Power, in
>
>         tmp = x;
>         ...
>         tmp2 = y;
>         if (tmp == tmp2) {
>             tmp3 = tmp2.a;
>         }
>
>         the last assignment can no longer be replaced by tmp3 = tmp.a,
>         because that
>         wouldn't preserve ordering between the load of y and that of a.
>         (I suspect
>         that such a replacement can be beneficial if the branch can be
>         correctly
>         predicted, since tmp may be available earlier.)
>
>         Presumably similar rules already apply to final field optimization.
>
>
>     If Tmp.a is final, both the tmp and tmp2 reads are possible only
>     after tmp.a is (finally) set, so the optimization is OK.
>     (This requires that there be no address speculation for "new" objects.
>     Otherwise all sorts of Java security properties would be broken.)
>
>     -Doug
>
>
>
>
>
>


From email at pitr.ch  Sun Nov 20 21:37:20 2016
From: email at pitr.ch (Petr Chalupa)
Date: Sun, 20 Nov 2016 22:37:20 +0100
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <f178c06e-ac27-c920-dc99-aaec3f54523b@cs.oswego.edu>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
	<CAGv9LnPhTEC2mHUWE-mRM7-htZgLr4qqvntQkY=oBgUwzGkxRQ@mail.gmail.com>
	<f178c06e-ac27-c920-dc99-aaec3f54523b@cs.oswego.edu>
Message-ID: <CAGv9LnO1UP6DVcXMDiNKMiPvertHH=HM4n4nG9F-y2Da1StghA@mail.gmail.com>

Thanks, I did not realise this is actually ill-advised. The read of b is
racy so it should be no surprise that the value based on it in the final
field can differ.

However I've remembered http://www.hboehm.info/c++mm/why_undef.html and got
a thought how to change the example to be maybe problematic again:

In the constructor with a body as follows

final T a;
T b;

X() {
    T local = computeAValue();
    b = local;
    doMoreOtherThings();
    // b never modified, it is equal to local
    a = local; // line A
}

could a compiler decide to optimise line A to a read of the same value from
b (introducing the racy read) instead of local variable to save space? What
am I missing, what prevents compiler to do optimisation like that?

Best regards,
Petr Chalupa

On Fri, Nov 18, 2016 at 1:12 AM, Doug Lea <dl at cs.oswego.edu> wrote:

> On 11/17/2016 06:41 PM, Petr Chalupa wrote:
>
>> Hello,
>>
>> If there is only StoreStore barrier at the end of a constructor then
>> following code concerns me:
>>
>
> There are several ill-advised things people can do in constructors that
> cause the base final field guarantee to be useless. Most famously,
> publishing "this" before assigning the field.
>
> static C global;
> class C {
>   final int a;
>   C (int a) { global = this; this.a = a; }
> }
>
> And as your example shows, initializing a final with the result of
> a computation reading a non-final field is also a bad idea.
> There are probably others too, all of which one hopes any concurrent
> programmer can see are too crazy to do. (And which good tools would
> help point out.)
>
> -Doug
>
>
>> // Thread 1:
>>
>> class X {
>>       static X instance;
>>       final int a;
>>       int b;
>>
>>       X() {
>>           a = 0;
>>           a++;
>>           b = 10
>>           a += b; // could read 42?
>>       }
>> }
>>
>> void publish() {
>>     X.instance = new X();
>> }
>>
>> // Thread 2:
>> X.instance.b = 42;
>>
>> Could the read of b in constructor see 42? If it can, a StoreLoad might
>> be required as well.
>> Could you confirm or explain where my thought was wrong. Thanks.
>>
>> Best regards,
>> Petr Chalupa
>>
>> On Wed, Nov 16, 2016 at 1:56 PM, Doug Lea <dl at cs.oswego.edu
>> <mailto:dl at cs.oswego.edu>> wrote:
>>
>>     On 11/15/2016 01:44 PM, Hans Boehm wrote:
>>
>>         Generalizing final field memory ordering to non-final fields
>>         also has
>>         optimization consequences on the reader side that we're still
>>         struggling
>>         with for C++.
>>
>>         For example, on any flavor of ARM or Power, in
>>
>>         tmp = x;
>>         ...
>>         tmp2 = y;
>>         if (tmp == tmp2) {
>>             tmp3 = tmp2.a;
>>         }
>>
>>         the last assignment can no longer be replaced by tmp3 = tmp.a,
>>         because that
>>         wouldn't preserve ordering between the load of y and that of a.
>>         (I suspect
>>         that such a replacement can be beneficial if the branch can be
>>         correctly
>>         predicted, since tmp may be available earlier.)
>>
>>         Presumably similar rules already apply to final field
>> optimization.
>>
>>
>>     If Tmp.a is final, both the tmp and tmp2 reads are possible only
>>     after tmp.a is (finally) set, so the optimization is OK.
>>     (This requires that there be no address speculation for "new" objects.
>>     Otherwise all sorts of Java security properties would be broken.)
>>
>>     -Doug
>>
>>
>>
>>
>>
>>
>>
>

From boehm at acm.org  Sun Nov 20 23:36:54 2016
From: boehm at acm.org (Hans Boehm)
Date: Sun, 20 Nov 2016 15:36:54 -0800
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
Message-ID: <CAPUmR1Z9CDGdHYhMz0wz7_L9+rqFRBaZS2rJdZYAJnwXe5trQw@mail.gmail.com>

On Wed, Nov 16, 2016 at 4:56 AM, Doug Lea <dl at cs.oswego.edu> wrote:

> On 11/15/2016 01:44 PM, Hans Boehm wrote:
>
> Generalizing final field memory ordering to non-final fields also has
>> optimization consequences on the reader side that we're still struggling
>> with for C++.
>>
>> For example, on any flavor of ARM or Power, in
>>
>> tmp = x;
>> ...
>> tmp2 = y;
>> if (tmp == tmp2) {
>>     tmp3 = tmp2.a;
>> }
>>
>> the last assignment can no longer be replaced by tmp3 = tmp.a, because
>> that
>> wouldn't preserve ordering between the load of y and that of a. (I suspect
>> that such a replacement can be beneficial if the branch can be correctly
>> predicted, since tmp may be available earlier.)
>>
>> Presumably similar rules already apply to final field optimization.
>>
>
> If Tmp.a is final, both the tmp and tmp2 reads are possible only
> after tmp.a is (finally) set, so the optimization is OK.
> (This requires that there be no address speculation for "new" objects.
> Otherwise all sorts of Java security properties would be broken.)
>
> Is that correct?

Consider the case in which x is written before the constructor setting a
finishes, i.e. before the freeze action/fence, and y is set after the
constructor finishes.  I don't see how the transformation ensures that (in
the absence of a null pointer exception) the read of a still sees the
initialized value. (Recall that there is no longer an address dependency
from the load of y to the load of a after the transformation, though there
was before.)  But it looks to me like 17.5.1 says that the read of a should
see the initialized value, though I'm not positive about my reading.  And I
have a vague recollection that Jeremy's original proposal may have allowed
the read of a to see zero at this point?

Hans

From boehm at acm.org  Sun Nov 20 23:40:21 2016
From: boehm at acm.org (Hans Boehm)
Date: Sun, 20 Nov 2016 15:40:21 -0800
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <CAGv9LnO1UP6DVcXMDiNKMiPvertHH=HM4n4nG9F-y2Da1StghA@mail.gmail.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
	<CAGv9LnPhTEC2mHUWE-mRM7-htZgLr4qqvntQkY=oBgUwzGkxRQ@mail.gmail.com>
	<f178c06e-ac27-c920-dc99-aaec3f54523b@cs.oswego.edu>
	<CAGv9LnO1UP6DVcXMDiNKMiPvertHH=HM4n4nG9F-y2Da1StghA@mail.gmail.com>
Message-ID: <CAPUmR1aQO1f2fQxzMpoWEEZNnkBWpfbXU13qREMsCoPQd+rCmA@mail.gmail.com>

Java optimizers are not generally allowed to reread a globally visible
field when the original code didn't. This is yet another reason for that
restriction.

This is different from C and C++.

On Sun, Nov 20, 2016 at 1:37 PM, Petr Chalupa <email at pitr.ch> wrote:

> Thanks, I did not realise this is actually ill-advised. The read of b is
> racy so it should be no surprise that the value based on it in the final
> field can differ.
>
> However I've remembered http://www.hboehm.info/c++mm/why_undef.html and
> got
> a thought how to change the example to be maybe problematic again:
>
> In the constructor with a body as follows
>
> final T a;
> T b;
>
> X() {
>     T local = computeAValue();
>     b = local;
>     doMoreOtherThings();
>     // b never modified, it is equal to local
>     a = local; // line A
> }
>
> could a compiler decide to optimise line A to a read of the same value from
> b (introducing the racy read) instead of local variable to save space? What
> am I missing, what prevents compiler to do optimisation like that?
>
> Best regards,
> Petr Chalupa
>
> On Fri, Nov 18, 2016 at 1:12 AM, Doug Lea <dl at cs.oswego.edu> wrote:
>
> > On 11/17/2016 06:41 PM, Petr Chalupa wrote:
> >
> >> Hello,
> >>
> >> If there is only StoreStore barrier at the end of a constructor then
> >> following code concerns me:
> >>
> >
> > There are several ill-advised things people can do in constructors that
> > cause the base final field guarantee to be useless. Most famously,
> > publishing "this" before assigning the field.
> >
> > static C global;
> > class C {
> >   final int a;
> >   C (int a) { global = this; this.a = a; }
> > }
> >
> > And as your example shows, initializing a final with the result of
> > a computation reading a non-final field is also a bad idea.
> > There are probably others too, all of which one hopes any concurrent
> > programmer can see are too crazy to do. (And which good tools would
> > help point out.)
> >
> > -Doug
> >
> >
> >> // Thread 1:
> >>
> >> class X {
> >>       static X instance;
> >>       final int a;
> >>       int b;
> >>
> >>       X() {
> >>           a = 0;
> >>           a++;
> >>           b = 10
> >>           a += b; // could read 42?
> >>       }
> >> }
> >>
> >> void publish() {
> >>     X.instance = new X();
> >> }
> >>
> >> // Thread 2:
> >> X.instance.b = 42;
> >>
> >> Could the read of b in constructor see 42? If it can, a StoreLoad might
> >> be required as well.
> >> Could you confirm or explain where my thought was wrong. Thanks.
> >>
> >> Best regards,
> >> Petr Chalupa
> >>
> >> On Wed, Nov 16, 2016 at 1:56 PM, Doug Lea <dl at cs.oswego.edu
> >> <mailto:dl at cs.oswego.edu>> wrote:
> >>
> >>     On 11/15/2016 01:44 PM, Hans Boehm wrote:
> >>
> >>         Generalizing final field memory ordering to non-final fields
> >>         also has
> >>         optimization consequences on the reader side that we're still
> >>         struggling
> >>         with for C++.
> >>
> >>         For example, on any flavor of ARM or Power, in
> >>
> >>         tmp = x;
> >>         ...
> >>         tmp2 = y;
> >>         if (tmp == tmp2) {
> >>             tmp3 = tmp2.a;
> >>         }
> >>
> >>         the last assignment can no longer be replaced by tmp3 = tmp.a,
> >>         because that
> >>         wouldn't preserve ordering between the load of y and that of a.
> >>         (I suspect
> >>         that such a replacement can be beneficial if the branch can be
> >>         correctly
> >>         predicted, since tmp may be available earlier.)
> >>
> >>         Presumably similar rules already apply to final field
> >> optimization.
> >>
> >>
> >>     If Tmp.a is final, both the tmp and tmp2 reads are possible only
> >>     after tmp.a is (finally) set, so the optimization is OK.
> >>     (This requires that there be no address speculation for "new"
> objects.
> >>     Otherwise all sorts of Java security properties would be broken.)
> >>
> >>     -Doug
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
>

From dl at cs.oswego.edu  Mon Nov 21 13:24:34 2016
From: dl at cs.oswego.edu (Doug Lea)
Date: Mon, 21 Nov 2016 08:24:34 -0500
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <CAPUmR1Z9CDGdHYhMz0wz7_L9+rqFRBaZS2rJdZYAJnwXe5trQw@mail.gmail.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
	<CAPUmR1Z9CDGdHYhMz0wz7_L9+rqFRBaZS2rJdZYAJnwXe5trQw@mail.gmail.com>
Message-ID: <ef00786d-b43e-9d33-1690-bf5d0b1eaa77@cs.oswego.edu>

On 11/20/2016 06:36 PM, Hans Boehm wrote:
> On Wed, Nov 16, 2016 at 4:56 AM, Doug Lea <dl at cs.oswego.edu
> <mailto:dl at cs.oswego.edu>> wrote:
>
>     On 11/15/2016 01:44 PM, Hans Boehm wrote:
>
>         Generalizing final field memory ordering to non-final fields
>         also has
>         optimization consequences on the reader side that we're still
>         struggling
>         with for C++.
>
>         For example, on any flavor of ARM or Power, in
>
>         tmp = x;
>         ...
>         tmp2 = y;
>         if (tmp == tmp2) {
>             tmp3 = tmp2.a;
>         }
>
>         the last assignment can no longer be replaced by tmp3 = tmp.a,
>         because that
>         wouldn't preserve ordering between the load of y and that of a.
>         (I suspect
>         that such a replacement can be beneficial if the branch can be
>         correctly
>         predicted, since tmp may be available earlier.)
>
>         Presumably similar rules already apply to final field optimization.
>
>
>     If Tmp.a is final, both the tmp and tmp2 reads are possible only
>     after tmp.a is (finally) set, so the optimization is OK.
>     (This requires that there be no address speculation for "new" objects.
>     Otherwise all sorts of Java security properties would be broken.)
>
> Is that correct?

I think so, modulo the usual "we can't guarantee miracles" disclaimers...

>
> Consider the case in which x is written before the constructor setting a
> finishes, i.e. before the freeze action/fence, and y is set after the
> constructor finishes.

Meaning that the constructor published this as x before returning.

>   But it looks to me like 17.5.1 says that the
> read of a should see the initialized value, though I'm not positive
> about my reading.  And I have a vague recollection that Jeremy's
> original proposal may have allowed the read of a to see zero at this point?
>

In any case, I'm not sure we can/should decode JSR133 specs that we
know need fixing. For now, it seems that the most useful guarantee we
can make is the operational spec that any class with a final field
contains a storeStoreFence before/upon constructor return. As with
other VarHandle documentation, this is sometimes not enough, but the
best we have at the moment.

-Doug


From aph at redhat.com  Mon Nov 21 15:32:12 2016
From: aph at redhat.com (Andrew Haley)
Date: Mon, 21 Nov 2016 15:32:12 +0000
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <ef00786d-b43e-9d33-1690-bf5d0b1eaa77@cs.oswego.edu>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
	<CAPUmR1Z9CDGdHYhMz0wz7_L9+rqFRBaZS2rJdZYAJnwXe5trQw@mail.gmail.com>
	<ef00786d-b43e-9d33-1690-bf5d0b1eaa77@cs.oswego.edu>
Message-ID: <e5ff1b64-7967-db42-ff59-976ed7756eff@redhat.com>

On 21/11/16 13:24, Doug Lea wrote:
> In any case, I'm not sure we can/should decode JSR133 specs that we
> know need fixing. For now, it seems that the most useful guarantee we
> can make is the operational spec that any class with a final field
> contains a storeStoreFence before/upon constructor return. As with
> other VarHandle documentation, this is sometimes not enough, but the
> best we have at the moment.

We're working on this right now in Graal.  If an object does not
escape, is it legitimate to remove the StoreStore fence as well?  I
think it is, but it means that we have to treat

class X1 {
    final int x;

    X1() {
    }
}

and

class X1 {
    int x;

    X1() {
        VarHandle.storeStoreFence();
    }
}

differently.

Andrew.

From dl at cs.oswego.edu  Mon Nov 21 15:50:52 2016
From: dl at cs.oswego.edu (Doug Lea)
Date: Mon, 21 Nov 2016 10:50:52 -0500
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <e5ff1b64-7967-db42-ff59-976ed7756eff@redhat.com>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
	<CAPUmR1Z9CDGdHYhMz0wz7_L9+rqFRBaZS2rJdZYAJnwXe5trQw@mail.gmail.com>
	<ef00786d-b43e-9d33-1690-bf5d0b1eaa77@cs.oswego.edu>
	<e5ff1b64-7967-db42-ff59-976ed7756eff@redhat.com>
Message-ID: <f7202977-552e-2b76-c0f5-5af584da4b05@cs.oswego.edu>

On 11/21/2016 10:32 AM, Andrew Haley wrote:

> We're working on this right now in Graal.  If an object does not
> escape, is it legitimate to remove the StoreStore fence as well?  I

See the cookbook section "Removing barriers" that covers
some of these cases. (http://gee.cs.oswego.edu/dl/jmm/cookbook.html)

In general, don't just "remove" fences, instead move-and-merge them:
move them until they hit another that absorbs them (the same or
stronger).  In many but not all cases, this does have the same effect
as just removing them.


-Doug


From boehm at acm.org  Tue Nov 22 05:21:16 2016
From: boehm at acm.org (Hans Boehm)
Date: Mon, 21 Nov 2016 21:21:16 -0800
Subject: [jmm-dev] The JSR-133 Cookbook and final fields
In-Reply-To: <f7202977-552e-2b76-c0f5-5af584da4b05@cs.oswego.edu>
References: <a8b7096e-530f-65fe-3607-28b5f7982c67@redhat.com>
	<6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com>
	<CAPUmR1bBgs0ZChThN5cUqbAcCET+8s9UbV4PeULcQ6ZtBu-Vnw@mail.gmail.com>
	<a1f8dca0-e9df-8229-fa0c-dff74b11f5ed@cs.oswego.edu>
	<CAPUmR1Z9CDGdHYhMz0wz7_L9+rqFRBaZS2rJdZYAJnwXe5trQw@mail.gmail.com>
	<ef00786d-b43e-9d33-1690-bf5d0b1eaa77@cs.oswego.edu>
	<e5ff1b64-7967-db42-ff59-976ed7756eff@redhat.com>
	<f7202977-552e-2b76-c0f5-5af584da4b05@cs.oswego.edu>
Message-ID: <CAPUmR1b9E_bYSG1VssYTLRJeQPQBVUEiP9NMzg-SKMZrdW3n3w@mail.gmail.com>

I think in general the intent is that ordering constraints associated with
operations on thread-local objects should be safe to eliminate. Clearly
explicit fences do not have the same property. I think that it is be OK to
remove constructor fences associated with objects whose final fields are
not accessed by another thread. Clearly an explicit fence in a constructor
is entirely different. Just as fences that would otherwise be associated
with thread-local volatiles or monitors can be removed, but explicit fences
can at best be combined, as Doug suggests.

On Mon, Nov 21, 2016 at 7:50 AM, Doug Lea <dl at cs.oswego.edu> wrote:

> On 11/21/2016 10:32 AM, Andrew Haley wrote:
>
> We're working on this right now in Graal.  If an object does not
>> escape, is it legitimate to remove the StoreStore fence as well?  I
>>
>
> See the cookbook section "Removing barriers" that covers
> some of these cases. (http://gee.cs.oswego.edu/dl/jmm/cookbook.html)
>
> In general, don't just "remove" fences, instead move-and-merge them:
> move them until they hit another that absorbs them (the same or
> stronger).  In many but not all cases, this does have the same effect
> as just removing them.
>
>
>
> -Doug
>
>
>