From david.holmes at oracle.com Thu Nov 10 00:06:34 2016 From: david.holmes at oracle.com (David Holmes) Date: Thu, 10 Nov 2016 10:06:34 +1000 Subject: [jmm-dev] Store completion query - general and ARM Message-ID: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com> Does any part of the JMM require actual visibility/completion of volatile stores or is it only order that is defined (with an assumptions that all stores will complete in a finite time)? In relation to ARM specifically, Dekker style algorithms require visibility/completion of the store before the subsequent load, yet the example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory Models" shows the use of DMB, not DSB. Yet AFAICS DMB says nothing about completion whereas DSB does. ?? (To be honest I find the Group A/B description of DMB properties extremely hard to actually interpret wrt code like Dekker.) Thanks, David From aph at redhat.com Thu Nov 10 09:07:15 2016 From: aph at redhat.com (Andrew Haley) Date: Thu, 10 Nov 2016 09:07:15 +0000 Subject: [jmm-dev] Store completion query - general and ARM In-Reply-To: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com> References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com> Message-ID: <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com> On 10/11/16 00:06, David Holmes wrote: > Does any part of the JMM require actual visibility/completion of > volatile stores or is it only order that is defined (with an assumptions > that all stores will complete in a finite time)? Ordering is really all that we've got: all that memory fences can do is ensure that visibility of loads and stores is ordered in some way. > In relation to ARM specifically, Dekker style algorithms require > visibility/completion of the store before the subsequent load, yet the > example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory > Models" shows the use of DMB, not DSB. DMB is fine for that. Dekker doesn't need a store to be forced out of the caches, only that the store be made visible to other processors before any operations later in program order. > Yet AFAICS DMB says nothing about completion whereas DSB does. ?? > (To be honest I find the Group A/B description of DMB properties > extremely hard to actually interpret wrt code like Dekker.) DSB is only really needed if there are multiple caches of the same address, i.e. Icache and Dcache: it's necessary to force a store out into main memory in order to refresh he Icache. Andrew. From david.holmes at oracle.com Thu Nov 10 09:20:01 2016 From: david.holmes at oracle.com (David Holmes) Date: Thu, 10 Nov 2016 19:20:01 +1000 Subject: [jmm-dev] Store completion query - general and ARM In-Reply-To: <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com> References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com> <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com> Message-ID: <93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com> On 10/11/2016 7:07 PM, Andrew Haley wrote: > On 10/11/16 00:06, David Holmes wrote: >> Does any part of the JMM require actual visibility/completion of >> volatile stores or is it only order that is defined (with an assumptions >> that all stores will complete in a finite time)? > > Ordering is really all that we've got: all that memory fences can do > is ensure that visibility of loads and stores is ordered in some way. If we establish some global order of loads and stores, yes. That can in turn require that a store become visible prior to a given load. >> In relation to ARM specifically, Dekker style algorithms require >> visibility/completion of the store before the subsequent load, yet the >> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory >> Models" shows the use of DMB, not DSB. > > DMB is fine for that. Dekker doesn't need a store to be forced out of > the caches, only that the store be made visible to other processors > before any operations later in program order. Again it is far from obvious to me that DMB causes the store to be visible before any operations later in program order. I find the Group A / Group B formulation (and even the definition of "observe") to be quite obscure and hard to map to actual code behaviour. >> Yet AFAICS DMB says nothing about completion whereas DSB does. ?? >> (To be honest I find the Group A/B description of DMB properties >> extremely hard to actually interpret wrt code like Dekker.) > > DSB is only really needed if there are multiple caches of the same > address, i.e. Icache and Dcache: it's necessary to force a store out > into main memory in order to refresh he Icache. I thought only ISB had an effect relative to instructions/i-cache ?? Thanks, David > Andrew. > From aph at redhat.com Thu Nov 10 09:31:09 2016 From: aph at redhat.com (Andrew Haley) Date: Thu, 10 Nov 2016 09:31:09 +0000 Subject: [jmm-dev] Store completion query - general and ARM In-Reply-To: <93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com> References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com> <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com> <93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com> Message-ID: <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com> On 10/11/16 09:20, David Holmes wrote: > On 10/11/2016 7:07 PM, Andrew Haley wrote: >> On 10/11/16 00:06, David Holmes wrote: >>> Does any part of the JMM require actual visibility/completion of >>> volatile stores or is it only order that is defined (with an assumptions >>> that all stores will complete in a finite time)? >> >> Ordering is really all that we've got: all that memory fences can do >> is ensure that visibility of loads and stores is ordered in some way. > > If we establish some global order of loads and stores, yes. That can in > turn require that a store become visible prior to a given load. I agree. >>> In relation to ARM specifically, Dekker style algorithms require >>> visibility/completion of the store before the subsequent load, yet the >>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory >>> Models" shows the use of DMB, not DSB. >> >> DMB is fine for that. Dekker doesn't need a store to be forced out of >> the caches, only that the store be made visible to other processors >> before any operations later in program order. > > Again it is far from obvious to me that DMB causes the store to be > visible before any operations later in program order. I find the Group A > / Group B formulation (and even the definition of "observe") to be quite > obscure and hard to map to actual code behaviour. Indeed. The real problem is that ARM are trying to describe the memory model in an abstract way that does not overly constrain implementations. But a DMB really is sufficient to ensure that prior stores are visible. (Mind you, we don't need DMB to get sequentially- consistent behaviour that's enough for Java volatiles.) >>> Yet AFAICS DMB says nothing about completion whereas DSB does. ?? >>> (To be honest I find the Group A/B description of DMB properties >>> extremely hard to actually interpret wrt code like Dekker.) >> >> DSB is only really needed if there are multiple caches of the same >> address, i.e. Icache and Dcache: it's necessary to force a store out >> into main memory in order to refresh he Icache. > > I thought only ISB had an effect relative to instructions/i-cache ?? It does: you need DSB to ensure the visibility of the data cleaned from the Dcache, then ISB to synchronize the fetched instruction stream. Andrew. From Peter.Sewell at cl.cam.ac.uk Thu Nov 10 09:54:00 2016 From: Peter.Sewell at cl.cam.ac.uk (Peter Sewell) Date: Thu, 10 Nov 2016 09:54:00 +0000 Subject: [jmm-dev] Store completion query - general and ARM In-Reply-To: <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com> References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com> <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com> <93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com> <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com> Message-ID: If you want a more rigorous and concrete model that explains this, you might want to look at: http://www.cl.cam.ac.uk/~pes20/popl16-armv8/top.pdf The associated tool: www.cl.cam.ac.uk/~pes20/AArch64/ lets one run arbitrary model-allowed executions of tests: - select AArch64 test Tutorial/SB+dmb.sys - under Options, select "a larger set of transitions which we proved can be taken eagerly" - click Run, then it shows the initial state of the model with the initial transitions highlighted in green - take all the thread-local transitions of each thread (five each) - now you can see each thread's write, dmb, and read request in the "flowing model" abstract storage subsystem - in this model, the dmb sys keeps the write and read request in order as they flow down to memory, so no interleaving of the possible model transitions can break the Dekker's algorithm property. In this example, it happens that the read requests also can't be satisfied from writes that haven't hit main memory, but in general they can be satisfied earlier. For contrast, if you try the SB test without dmb, you'll many more possible executions. This flowing model is actually a bit more microarchitectural than one would like for an architectural spec, as it exposes the abstract interconnect topology. The POP model, also provided by that tool, abstracts from the topology. Both are principally due to Shaked Flur, cc'd. best, Peter On 10 November 2016 at 09:31, Andrew Haley wrote: > On 10/11/16 09:20, David Holmes wrote: > > On 10/11/2016 7:07 PM, Andrew Haley wrote: > >> On 10/11/16 00:06, David Holmes wrote: > >>> Does any part of the JMM require actual visibility/completion of > >>> volatile stores or is it only order that is defined (with an > assumptions > >>> that all stores will complete in a finite time)? > >> > >> Ordering is really all that we've got: all that memory fences can do > >> is ensure that visibility of loads and stores is ordered in some way. > > > > If we establish some global order of loads and stores, yes. That can in > > turn require that a store become visible prior to a given load. > > I agree. > > >>> In relation to ARM specifically, Dekker style algorithms require > >>> visibility/completion of the store before the subsequent load, yet the > >>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory > >>> Models" shows the use of DMB, not DSB. > >> > >> DMB is fine for that. Dekker doesn't need a store to be forced out of > >> the caches, only that the store be made visible to other processors > >> before any operations later in program order. > > > > Again it is far from obvious to me that DMB causes the store to be > > visible before any operations later in program order. I find the Group A > > / Group B formulation (and even the definition of "observe") to be quite > > obscure and hard to map to actual code behaviour. > > Indeed. The real problem is that ARM are trying to describe the > memory model in an abstract way that does not overly constrain > implementations. But a DMB really is sufficient to ensure that prior > stores are visible. (Mind you, we don't need DMB to get sequentially- > consistent behaviour that's enough for Java volatiles.) > > >>> Yet AFAICS DMB says nothing about completion whereas DSB does. ?? > >>> (To be honest I find the Group A/B description of DMB properties > >>> extremely hard to actually interpret wrt code like Dekker.) > >> > >> DSB is only really needed if there are multiple caches of the same > >> address, i.e. Icache and Dcache: it's necessary to force a store out > >> into main memory in order to refresh he Icache. > > > > I thought only ISB had an effect relative to instructions/i-cache ?? > > It does: you need DSB to ensure the visibility of the data cleaned > from the Dcache, then ISB to synchronize the fetched instruction > stream. > > Andrew. > > > From david.holmes at oracle.com Thu Nov 10 20:49:45 2016 From: david.holmes at oracle.com (David Holmes) Date: Fri, 11 Nov 2016 06:49:45 +1000 Subject: [jmm-dev] Store completion query - general and ARM In-Reply-To: References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com> <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com> <93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com> <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com> Message-ID: <589e836f-0c87-d64c-2196-0ca4d65e2968@oracle.com> On 10/11/2016 7:54 PM, Peter Sewell wrote: > If you want a more rigorous and concrete model that explains this, you > might want to look at: > http://www.cl.cam.ac.uk/~pes20/popl16-armv8/top.pdf Thanks Peter, I will take a look at this. David > The associated tool: www.cl.cam.ac.uk/~pes20/AArch64/ > > lets one run arbitrary model-allowed executions of tests: > - select AArch64 test Tutorial/SB+dmb.sys > - under Options, select "a larger set of transitions which we proved can > be taken eagerly" > - click Run, then it shows the initial state of the model with the > initial transitions highlighted in green > - take all the thread-local transitions of each thread (five each) > - now you can see each thread's write, dmb, and read request in the > "flowing model" abstract storage subsystem > - in this model, the dmb sys keeps the write and read request in order > as they flow down to memory, so no interleaving of the possible model > transitions can break the Dekker's algorithm property. > > In this example, it happens that the read requests also can't be > satisfied from writes that haven't hit main memory, but in general they > can be satisfied earlier. > > For contrast, if you try the SB test without dmb, you'll many more > possible executions. > > This flowing model is actually a bit more microarchitectural than one > would like for an architectural spec, as it exposes the abstract > interconnect topology. The POP model, also provided by that tool, > abstracts from the topology. Both are principally due to Shaked Flur, cc'd. > > best, > Peter > > > On 10 November 2016 at 09:31, Andrew Haley > wrote: > > On 10/11/16 09:20, David Holmes wrote: > > On 10/11/2016 7:07 PM, Andrew Haley wrote: > >> On 10/11/16 00:06, David Holmes wrote: > >>> Does any part of the JMM require actual visibility/completion of > >>> volatile stores or is it only order that is defined (with an assumptions > >>> that all stores will complete in a finite time)? > >> > >> Ordering is really all that we've got: all that memory fences can do > >> is ensure that visibility of loads and stores is ordered in some way. > > > > If we establish some global order of loads and stores, yes. That can in > > turn require that a store become visible prior to a given load. > > I agree. > > >>> In relation to ARM specifically, Dekker style algorithms require > >>> visibility/completion of the store before the subsequent load, yet the > >>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory > >>> Models" shows the use of DMB, not DSB. > >> > >> DMB is fine for that. Dekker doesn't need a store to be forced out of > >> the caches, only that the store be made visible to other processors > >> before any operations later in program order. > > > > Again it is far from obvious to me that DMB causes the store to be > > visible before any operations later in program order. I find the Group A > > / Group B formulation (and even the definition of "observe") to be quite > > obscure and hard to map to actual code behaviour. > > Indeed. The real problem is that ARM are trying to describe the > memory model in an abstract way that does not overly constrain > implementations. But a DMB really is sufficient to ensure that prior > stores are visible. (Mind you, we don't need DMB to get sequentially- > consistent behaviour that's enough for Java volatiles.) > > >>> Yet AFAICS DMB says nothing about completion whereas DSB does. ?? > >>> (To be honest I find the Group A/B description of DMB properties > >>> extremely hard to actually interpret wrt code like Dekker.) > >> > >> DSB is only really needed if there are multiple caches of the same > >> address, i.e. Icache and Dcache: it's necessary to force a store out > >> into main memory in order to refresh he Icache. > > > > I thought only ISB had an effect relative to instructions/i-cache ?? > > It does: you need DSB to ensure the visibility of the data cleaned > from the Dcache, then ISB to synchronize the fetched instruction > stream. > > Andrew. > > > From david.holmes at oracle.com Thu Nov 10 20:52:51 2016 From: david.holmes at oracle.com (David Holmes) Date: Fri, 11 Nov 2016 06:52:51 +1000 Subject: [jmm-dev] Store completion query - general and ARM In-Reply-To: <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com> References: <1cd582da-0211-8bc3-7c61-09bf6706e93c@oracle.com> <9211c13c-e227-1545-f87c-893c1d6fff90@redhat.com> <93af1932-b873-c9e7-d1b5-0832eb5605b9@oracle.com> <65def6c7-d17b-51f6-0493-115152b55f2c@redhat.com> Message-ID: <76c7fbfd-da6a-9a9c-28ca-c1824c6c030f@oracle.com> On 10/11/2016 7:31 PM, Andrew Haley wrote: > On 10/11/16 09:20, David Holmes wrote: >> On 10/11/2016 7:07 PM, Andrew Haley wrote: >>> On 10/11/16 00:06, David Holmes wrote: >>>> Does any part of the JMM require actual visibility/completion of >>>> volatile stores or is it only order that is defined (with an assumptions >>>> that all stores will complete in a finite time)? >>> >>> Ordering is really all that we've got: all that memory fences can do >>> is ensure that visibility of loads and stores is ordered in some way. >> >> If we establish some global order of loads and stores, yes. That can in >> turn require that a store become visible prior to a given load. > > I agree. > >>>> In relation to ARM specifically, Dekker style algorithms require >>>> visibility/completion of the store before the subsequent load, yet the >>>> example in "A Tutorial Introduction to the ARM and POWER Relaxed Memory >>>> Models" shows the use of DMB, not DSB. >>> >>> DMB is fine for that. Dekker doesn't need a store to be forced out of >>> the caches, only that the store be made visible to other processors >>> before any operations later in program order. >> >> Again it is far from obvious to me that DMB causes the store to be >> visible before any operations later in program order. I find the Group A >> / Group B formulation (and even the definition of "observe") to be quite >> obscure and hard to map to actual code behaviour. > > Indeed. The real problem is that ARM are trying to describe the > memory model in an abstract way that does not overly constrain I just wish they had included the word "complete" or "visible" in that abstract description. :) > implementations. But a DMB really is sufficient to ensure that prior > stores are visible. (Mind you, we don't need DMB to get sequentially- > consistent behaviour that's enough for Java volatiles.) I was going to ask how that can be true, but then saw this in the paper Peter referenced: "According to the ARM ARM, store-release is multicopy- atomic when observed by load-acquires, a strong property that conventional release-acquire semantics does not imply. Furthermore, despite their names, these instructions are intended to be used to implement the C11 sequentially consistent load and store." That is new information to me, and somewhat surprising. Thanks, David > >>>> Yet AFAICS DMB says nothing about completion whereas DSB does. ?? >>>> (To be honest I find the Group A/B description of DMB properties >>>> extremely hard to actually interpret wrt code like Dekker.) >>> >>> DSB is only really needed if there are multiple caches of the same >>> address, i.e. Icache and Dcache: it's necessary to force a store out >>> into main memory in order to refresh he Icache. >> >> I thought only ISB had an effect relative to instructions/i-cache ?? > > It does: you need DSB to ensure the visibility of the data cleaned > from the Dcache, then ISB to synchronize the fetched instruction > stream. > > Andrew. > > From aph at redhat.com Tue Nov 15 11:43:27 2016 From: aph at redhat.com (Andrew Haley) Date: Tue, 15 Nov 2016 11:43:27 +0000 Subject: [jmm-dev] The JSR-133 Cookbook and final fields Message-ID: http://g.oswego.edu/dl/jmm/cookbook.html says: ... the special final-field rule requiring a StoreStore barrier in x.finalField = v; StoreStore; sharedRef = x; but http://www.hboehm.info/c++mm/no_write_fences.html says: ... it is also generally unsafe to restrict the release ordering constraint in thread 1 to only stores. To see this, consider what happens if the initialization of x also reads x I am convinced by Hans Boehm's argument in the second reference, and I believe that only to use a StoreStore fence is too fragile unless you disallow some optimizations. Thread 1: class X { int x; X() { a = 0; a++; } } void publish() { X x = new X(); } Thread 2: x.a = 42; This is safe enough at the Java level, but inlining of constructors at the machine level mean that it's hard to guarantee without a LoadStore at the end of the constructor. On AArch64 at least we have address dependency ordering from a load to a memory op based on it, which is adequate in this case, I think. I'd prefer to simply have an adjudication that we need a release barrier at the end of a constructor, but mostly I'd like some sort of decision. Thanks, Andrew. From aph at redhat.com Tue Nov 15 13:53:22 2016 From: aph at redhat.com (Andrew Haley) Date: Tue, 15 Nov 2016 13:53:22 +0000 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: Message-ID: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> It's been pointed out to me that my example doesn't have a final field! It would perhaps have been better not to provide an example, so rather than muddy the water any further I'll let the question stand. Andrew. From boehm at acm.org Tue Nov 15 18:44:01 2016 From: boehm at acm.org (Hans Boehm) Date: Tue, 15 Nov 2016 10:44:01 -0800 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: I think this is actually OK for final fields, since no other thread can write them, and hence reads in the constructor can't really see a write by another thread. I continue to believe that we should not generalize the final field behavior to non-final fields, at least not without generalizing the constructor barrier to also include LoadStore. Which I think means we're kind of in agreement. If we did so, and programmers took advantage of that, it would also mean that constructor() { non_final_field = 0; assert non_final_field == 0; } could reasonably fail, which seems bad. Generalizing final field memory ordering to non-final fields also has optimization consequences on the reader side that we're still struggling with for C++. For example, on any flavor of ARM or Power, in tmp = x; ... tmp2 = y; if (tmp == tmp2) { tmp3 = tmp2.a; } the last assignment can no longer be replaced by tmp3 = tmp.a, because that wouldn't preserve ordering between the load of y and that of a. (I suspect that such a replacement can be beneficial if the branch can be correctly predicted, since tmp may be available earlier.) Presumably similar rules already apply to final field optimization. I have no idea whether existing Java compilers actually make such distinctions. On Tue, Nov 15, 2016 at 5:53 AM, Andrew Haley wrote: > It's been pointed out to me that my example doesn't have a final > field! It would perhaps have been better not to provide an example, > so rather than muddy the water any further I'll let the question > stand. > > Andrew. > From paulmck at linux.vnet.ibm.com Tue Nov 15 19:12:48 2016 From: paulmck at linux.vnet.ibm.com (Paul E. McKenney) Date: Tue, 15 Nov 2016 11:12:48 -0800 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: <20161115191248.GA3612@linux.vnet.ibm.com> For whatever it might be worth, we made a similar change in the Linux kernel some time back. The rcu_assign_pointer() macro used to contain a store-store fence, but was upgraded to a store-release of the new pointer value about 3 years ago in the 3.15 release. Thanx, Paul On Tue, Nov 15, 2016 at 10:44:01AM -0800, Hans Boehm wrote: > I think this is actually OK for final fields, since no other thread can > write them, and hence reads in the constructor can't really see a write by > another thread. > > I continue to believe that we should not generalize the final field > behavior to non-final fields, at least not without generalizing the > constructor barrier to also include LoadStore. Which I think means we're > kind of in agreement. If we did so, and programmers took advantage of that, > it would also mean that constructor() { non_final_field = 0; assert > non_final_field == 0; } could reasonably fail, which seems bad. > > Generalizing final field memory ordering to non-final fields also has > optimization consequences on the reader side that we're still struggling > with for C++. > > For example, on any flavor of ARM or Power, in > > tmp = x; > ... > tmp2 = y; > if (tmp == tmp2) { > tmp3 = tmp2.a; > } > > the last assignment can no longer be replaced by tmp3 = tmp.a, because that > wouldn't preserve ordering between the load of y and that of a. (I suspect > that such a replacement can be beneficial if the branch can be correctly > predicted, since tmp may be available earlier.) > > Presumably similar rules already apply to final field optimization. I have > no idea whether existing Java compilers actually make such distinctions. > > On Tue, Nov 15, 2016 at 5:53 AM, Andrew Haley wrote: > > > It's been pointed out to me that my example doesn't have a final > > field! It would perhaps have been better not to provide an example, > > so rather than muddy the water any further I'll let the question > > stand. > > > > Andrew. > > > From dl at cs.oswego.edu Tue Nov 15 20:19:22 2016 From: dl at cs.oswego.edu (Doug Lea) Date: Tue, 15 Nov 2016 15:19:22 -0500 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: Message-ID: On 11/15/2016 06:43 AM, Andrew Haley wrote: > http://g.oswego.edu/dl/jmm/cookbook.html says: > > ... the special final-field rule requiring a StoreStore barrier in > x.finalField = v; StoreStore; sharedRef = x; Note that the fence be placed any time after write of final field and before return from constructor. In practice, all JVMs I know place a fence immediately before return if any field is final, covering this requirement in a simple way. Including odd cases like programs that assign a final field twice in a constructor, which isn't illegal. (Most people think it ought to be illegal, but too late for that.) > > but http://www.hboehm.info/c++mm/no_write_fences.html says: > > ... it is also generally unsafe to restrict the release ordering > constraint in thread 1 to only stores. To see this, consider what > happens if the initialization of x also reads x As Andrew mentioned, this discussion is not about analogs of final fields, but instead about cases where fields can be (re)-written by consumers. As Hans stated in the subsequent section of that document (and agreed to by others in a few brief exchanges about this on this list in 2014), "In this case, it appears to be safe ...". Which is not to say devoid of all possible surprises. But it is surely sufficient with respect to the basic security issues that are the primary motivation for special rules for final fields. [...omitted unrelated example...] > > I'd prefer to simply have an adjudication that we need a release > barrier at the end of a constructor, but mostly I'd like some sort of > decision. > If processors intrinsically performed a releaseFence whenever asked for just a storeStoreFence, it might be defensible to simplify rules to just say release here. But this isn't so on ARM and possibly others. -Doug From dl at cs.oswego.edu Wed Nov 16 12:56:29 2016 From: dl at cs.oswego.edu (Doug Lea) Date: Wed, 16 Nov 2016 07:56:29 -0500 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: On 11/15/2016 01:44 PM, Hans Boehm wrote: > Generalizing final field memory ordering to non-final fields also has > optimization consequences on the reader side that we're still struggling > with for C++. > > For example, on any flavor of ARM or Power, in > > tmp = x; > ... > tmp2 = y; > if (tmp == tmp2) { > tmp3 = tmp2.a; > } > > the last assignment can no longer be replaced by tmp3 = tmp.a, because that > wouldn't preserve ordering between the load of y and that of a. (I suspect > that such a replacement can be beneficial if the branch can be correctly > predicted, since tmp may be available earlier.) > > Presumably similar rules already apply to final field optimization. If Tmp.a is final, both the tmp and tmp2 reads are possible only after tmp.a is (finally) set, so the optimization is OK. (This requires that there be no address speculation for "new" objects. Otherwise all sorts of Java security properties would be broken.) -Doug From email at pitr.ch Thu Nov 17 23:41:01 2016 From: email at pitr.ch (Petr Chalupa) Date: Fri, 18 Nov 2016 00:41:01 +0100 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: Hello, If there is only StoreStore barrier at the end of a constructor then following code concerns me: // Thread 1: class X { static X instance; final int a; int b; X() { a = 0; a++; b = 10 a += b; // could read 42? } } void publish() { X.instance = new X(); } // Thread 2: X.instance.b = 42; Could the read of b in constructor see 42? If it can, a StoreLoad might be required as well. Could you confirm or explain where my thought was wrong. Thanks. Best regards, Petr Chalupa On Wed, Nov 16, 2016 at 1:56 PM, Doug Lea
wrote: > On 11/15/2016 01:44 PM, Hans Boehm wrote: > > Generalizing final field memory ordering to non-final fields also has >> optimization consequences on the reader side that we're still struggling >> with for C++. >> >> For example, on any flavor of ARM or Power, in >> >> tmp = x; >> ... >> tmp2 = y; >> if (tmp == tmp2) { >> tmp3 = tmp2.a; >> } >> >> the last assignment can no longer be replaced by tmp3 = tmp.a, because >> that >> wouldn't preserve ordering between the load of y and that of a. (I suspect >> that such a replacement can be beneficial if the branch can be correctly >> predicted, since tmp may be available earlier.) >> >> Presumably similar rules already apply to final field optimization. >> > > If Tmp.a is final, both the tmp and tmp2 reads are possible only > after tmp.a is (finally) set, so the optimization is OK. > (This requires that there be no address speculation for "new" objects. > Otherwise all sorts of Java security properties would be broken.) > > -Doug > > > > > > From dl at cs.oswego.edu Fri Nov 18 00:12:22 2016 From: dl at cs.oswego.edu (Doug Lea) Date: Thu, 17 Nov 2016 19:12:22 -0500 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: On 11/17/2016 06:41 PM, Petr Chalupa wrote: > Hello, > > If there is only StoreStore barrier at the end of a constructor then > following code concerns me: There are several ill-advised things people can do in constructors that cause the base final field guarantee to be useless. Most famously, publishing "this" before assigning the field. static C global; class C { final int a; C (int a) { global = this; this.a = a; } } And as your example shows, initializing a final with the result of a computation reading a non-final field is also a bad idea. There are probably others too, all of which one hopes any concurrent programmer can see are too crazy to do. (And which good tools would help point out.) -Doug > > // Thread 1: > > class X { > static X instance; > final int a; > int b; > > X() { > a = 0; > a++; > b = 10 > a += b; // could read 42? > } > } > > void publish() { > X.instance = new X(); > } > > // Thread 2: > X.instance.b = 42; > > Could the read of b in constructor see 42? If it can, a StoreLoad might > be required as well. > Could you confirm or explain where my thought was wrong. Thanks. > > Best regards, > Petr Chalupa > > On Wed, Nov 16, 2016 at 1:56 PM, Doug Lea
> wrote: > > On 11/15/2016 01:44 PM, Hans Boehm wrote: > > Generalizing final field memory ordering to non-final fields > also has > optimization consequences on the reader side that we're still > struggling > with for C++. > > For example, on any flavor of ARM or Power, in > > tmp = x; > ... > tmp2 = y; > if (tmp == tmp2) { > tmp3 = tmp2.a; > } > > the last assignment can no longer be replaced by tmp3 = tmp.a, > because that > wouldn't preserve ordering between the load of y and that of a. > (I suspect > that such a replacement can be beneficial if the branch can be > correctly > predicted, since tmp may be available earlier.) > > Presumably similar rules already apply to final field optimization. > > > If Tmp.a is final, both the tmp and tmp2 reads are possible only > after tmp.a is (finally) set, so the optimization is OK. > (This requires that there be no address speculation for "new" objects. > Otherwise all sorts of Java security properties would be broken.) > > -Doug > > > > > > From email at pitr.ch Sun Nov 20 21:37:20 2016 From: email at pitr.ch (Petr Chalupa) Date: Sun, 20 Nov 2016 22:37:20 +0100 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: Thanks, I did not realise this is actually ill-advised. The read of b is racy so it should be no surprise that the value based on it in the final field can differ. However I've remembered http://www.hboehm.info/c++mm/why_undef.html and got a thought how to change the example to be maybe problematic again: In the constructor with a body as follows final T a; T b; X() { T local = computeAValue(); b = local; doMoreOtherThings(); // b never modified, it is equal to local a = local; // line A } could a compiler decide to optimise line A to a read of the same value from b (introducing the racy read) instead of local variable to save space? What am I missing, what prevents compiler to do optimisation like that? Best regards, Petr Chalupa On Fri, Nov 18, 2016 at 1:12 AM, Doug Lea
wrote: > On 11/17/2016 06:41 PM, Petr Chalupa wrote: > >> Hello, >> >> If there is only StoreStore barrier at the end of a constructor then >> following code concerns me: >> > > There are several ill-advised things people can do in constructors that > cause the base final field guarantee to be useless. Most famously, > publishing "this" before assigning the field. > > static C global; > class C { > final int a; > C (int a) { global = this; this.a = a; } > } > > And as your example shows, initializing a final with the result of > a computation reading a non-final field is also a bad idea. > There are probably others too, all of which one hopes any concurrent > programmer can see are too crazy to do. (And which good tools would > help point out.) > > -Doug > > >> // Thread 1: >> >> class X { >> static X instance; >> final int a; >> int b; >> >> X() { >> a = 0; >> a++; >> b = 10 >> a += b; // could read 42? >> } >> } >> >> void publish() { >> X.instance = new X(); >> } >> >> // Thread 2: >> X.instance.b = 42; >> >> Could the read of b in constructor see 42? If it can, a StoreLoad might >> be required as well. >> Could you confirm or explain where my thought was wrong. Thanks. >> >> Best regards, >> Petr Chalupa >> >> On Wed, Nov 16, 2016 at 1:56 PM, Doug Lea
> > wrote: >> >> On 11/15/2016 01:44 PM, Hans Boehm wrote: >> >> Generalizing final field memory ordering to non-final fields >> also has >> optimization consequences on the reader side that we're still >> struggling >> with for C++. >> >> For example, on any flavor of ARM or Power, in >> >> tmp = x; >> ... >> tmp2 = y; >> if (tmp == tmp2) { >> tmp3 = tmp2.a; >> } >> >> the last assignment can no longer be replaced by tmp3 = tmp.a, >> because that >> wouldn't preserve ordering between the load of y and that of a. >> (I suspect >> that such a replacement can be beneficial if the branch can be >> correctly >> predicted, since tmp may be available earlier.) >> >> Presumably similar rules already apply to final field >> optimization. >> >> >> If Tmp.a is final, both the tmp and tmp2 reads are possible only >> after tmp.a is (finally) set, so the optimization is OK. >> (This requires that there be no address speculation for "new" objects. >> Otherwise all sorts of Java security properties would be broken.) >> >> -Doug >> >> >> >> >> >> >> > From boehm at acm.org Sun Nov 20 23:36:54 2016 From: boehm at acm.org (Hans Boehm) Date: Sun, 20 Nov 2016 15:36:54 -0800 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: On Wed, Nov 16, 2016 at 4:56 AM, Doug Lea
wrote: > On 11/15/2016 01:44 PM, Hans Boehm wrote: > > Generalizing final field memory ordering to non-final fields also has >> optimization consequences on the reader side that we're still struggling >> with for C++. >> >> For example, on any flavor of ARM or Power, in >> >> tmp = x; >> ... >> tmp2 = y; >> if (tmp == tmp2) { >> tmp3 = tmp2.a; >> } >> >> the last assignment can no longer be replaced by tmp3 = tmp.a, because >> that >> wouldn't preserve ordering between the load of y and that of a. (I suspect >> that such a replacement can be beneficial if the branch can be correctly >> predicted, since tmp may be available earlier.) >> >> Presumably similar rules already apply to final field optimization. >> > > If Tmp.a is final, both the tmp and tmp2 reads are possible only > after tmp.a is (finally) set, so the optimization is OK. > (This requires that there be no address speculation for "new" objects. > Otherwise all sorts of Java security properties would be broken.) > > Is that correct? Consider the case in which x is written before the constructor setting a finishes, i.e. before the freeze action/fence, and y is set after the constructor finishes. I don't see how the transformation ensures that (in the absence of a null pointer exception) the read of a still sees the initialized value. (Recall that there is no longer an address dependency from the load of y to the load of a after the transformation, though there was before.) But it looks to me like 17.5.1 says that the read of a should see the initialized value, though I'm not positive about my reading. And I have a vague recollection that Jeremy's original proposal may have allowed the read of a to see zero at this point? Hans From boehm at acm.org Sun Nov 20 23:40:21 2016 From: boehm at acm.org (Hans Boehm) Date: Sun, 20 Nov 2016 15:40:21 -0800 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: Java optimizers are not generally allowed to reread a globally visible field when the original code didn't. This is yet another reason for that restriction. This is different from C and C++. On Sun, Nov 20, 2016 at 1:37 PM, Petr Chalupa wrote: > Thanks, I did not realise this is actually ill-advised. The read of b is > racy so it should be no surprise that the value based on it in the final > field can differ. > > However I've remembered http://www.hboehm.info/c++mm/why_undef.html and > got > a thought how to change the example to be maybe problematic again: > > In the constructor with a body as follows > > final T a; > T b; > > X() { > T local = computeAValue(); > b = local; > doMoreOtherThings(); > // b never modified, it is equal to local > a = local; // line A > } > > could a compiler decide to optimise line A to a read of the same value from > b (introducing the racy read) instead of local variable to save space? What > am I missing, what prevents compiler to do optimisation like that? > > Best regards, > Petr Chalupa > > On Fri, Nov 18, 2016 at 1:12 AM, Doug Lea
wrote: > > > On 11/17/2016 06:41 PM, Petr Chalupa wrote: > > > >> Hello, > >> > >> If there is only StoreStore barrier at the end of a constructor then > >> following code concerns me: > >> > > > > There are several ill-advised things people can do in constructors that > > cause the base final field guarantee to be useless. Most famously, > > publishing "this" before assigning the field. > > > > static C global; > > class C { > > final int a; > > C (int a) { global = this; this.a = a; } > > } > > > > And as your example shows, initializing a final with the result of > > a computation reading a non-final field is also a bad idea. > > There are probably others too, all of which one hopes any concurrent > > programmer can see are too crazy to do. (And which good tools would > > help point out.) > > > > -Doug > > > > > >> // Thread 1: > >> > >> class X { > >> static X instance; > >> final int a; > >> int b; > >> > >> X() { > >> a = 0; > >> a++; > >> b = 10 > >> a += b; // could read 42? > >> } > >> } > >> > >> void publish() { > >> X.instance = new X(); > >> } > >> > >> // Thread 2: > >> X.instance.b = 42; > >> > >> Could the read of b in constructor see 42? If it can, a StoreLoad might > >> be required as well. > >> Could you confirm or explain where my thought was wrong. Thanks. > >> > >> Best regards, > >> Petr Chalupa > >> > >> On Wed, Nov 16, 2016 at 1:56 PM, Doug Lea
>> > wrote: > >> > >> On 11/15/2016 01:44 PM, Hans Boehm wrote: > >> > >> Generalizing final field memory ordering to non-final fields > >> also has > >> optimization consequences on the reader side that we're still > >> struggling > >> with for C++. > >> > >> For example, on any flavor of ARM or Power, in > >> > >> tmp = x; > >> ... > >> tmp2 = y; > >> if (tmp == tmp2) { > >> tmp3 = tmp2.a; > >> } > >> > >> the last assignment can no longer be replaced by tmp3 = tmp.a, > >> because that > >> wouldn't preserve ordering between the load of y and that of a. > >> (I suspect > >> that such a replacement can be beneficial if the branch can be > >> correctly > >> predicted, since tmp may be available earlier.) > >> > >> Presumably similar rules already apply to final field > >> optimization. > >> > >> > >> If Tmp.a is final, both the tmp and tmp2 reads are possible only > >> after tmp.a is (finally) set, so the optimization is OK. > >> (This requires that there be no address speculation for "new" > objects. > >> Otherwise all sorts of Java security properties would be broken.) > >> > >> -Doug > >> > >> > >> > >> > >> > >> > >> > > > From dl at cs.oswego.edu Mon Nov 21 13:24:34 2016 From: dl at cs.oswego.edu (Doug Lea) Date: Mon, 21 Nov 2016 08:24:34 -0500 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: On 11/20/2016 06:36 PM, Hans Boehm wrote: > On Wed, Nov 16, 2016 at 4:56 AM, Doug Lea
> wrote: > > On 11/15/2016 01:44 PM, Hans Boehm wrote: > > Generalizing final field memory ordering to non-final fields > also has > optimization consequences on the reader side that we're still > struggling > with for C++. > > For example, on any flavor of ARM or Power, in > > tmp = x; > ... > tmp2 = y; > if (tmp == tmp2) { > tmp3 = tmp2.a; > } > > the last assignment can no longer be replaced by tmp3 = tmp.a, > because that > wouldn't preserve ordering between the load of y and that of a. > (I suspect > that such a replacement can be beneficial if the branch can be > correctly > predicted, since tmp may be available earlier.) > > Presumably similar rules already apply to final field optimization. > > > If Tmp.a is final, both the tmp and tmp2 reads are possible only > after tmp.a is (finally) set, so the optimization is OK. > (This requires that there be no address speculation for "new" objects. > Otherwise all sorts of Java security properties would be broken.) > > Is that correct? I think so, modulo the usual "we can't guarantee miracles" disclaimers... > > Consider the case in which x is written before the constructor setting a > finishes, i.e. before the freeze action/fence, and y is set after the > constructor finishes. Meaning that the constructor published this as x before returning. > But it looks to me like 17.5.1 says that the > read of a should see the initialized value, though I'm not positive > about my reading. And I have a vague recollection that Jeremy's > original proposal may have allowed the read of a to see zero at this point? > In any case, I'm not sure we can/should decode JSR133 specs that we know need fixing. For now, it seems that the most useful guarantee we can make is the operational spec that any class with a final field contains a storeStoreFence before/upon constructor return. As with other VarHandle documentation, this is sometimes not enough, but the best we have at the moment. -Doug From aph at redhat.com Mon Nov 21 15:32:12 2016 From: aph at redhat.com (Andrew Haley) Date: Mon, 21 Nov 2016 15:32:12 +0000 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: On 21/11/16 13:24, Doug Lea wrote: > In any case, I'm not sure we can/should decode JSR133 specs that we > know need fixing. For now, it seems that the most useful guarantee we > can make is the operational spec that any class with a final field > contains a storeStoreFence before/upon constructor return. As with > other VarHandle documentation, this is sometimes not enough, but the > best we have at the moment. We're working on this right now in Graal. If an object does not escape, is it legitimate to remove the StoreStore fence as well? I think it is, but it means that we have to treat class X1 { final int x; X1() { } } and class X1 { int x; X1() { VarHandle.storeStoreFence(); } } differently. Andrew. From dl at cs.oswego.edu Mon Nov 21 15:50:52 2016 From: dl at cs.oswego.edu (Doug Lea) Date: Mon, 21 Nov 2016 10:50:52 -0500 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: On 11/21/2016 10:32 AM, Andrew Haley wrote: > We're working on this right now in Graal. If an object does not > escape, is it legitimate to remove the StoreStore fence as well? I See the cookbook section "Removing barriers" that covers some of these cases. (http://gee.cs.oswego.edu/dl/jmm/cookbook.html) In general, don't just "remove" fences, instead move-and-merge them: move them until they hit another that absorbs them (the same or stronger). In many but not all cases, this does have the same effect as just removing them. -Doug From boehm at acm.org Tue Nov 22 05:21:16 2016 From: boehm at acm.org (Hans Boehm) Date: Mon, 21 Nov 2016 21:21:16 -0800 Subject: [jmm-dev] The JSR-133 Cookbook and final fields In-Reply-To: References: <6c9d4554-e4ba-f2a5-cf44-c3d33782674e@redhat.com> Message-ID: I think in general the intent is that ordering constraints associated with operations on thread-local objects should be safe to eliminate. Clearly explicit fences do not have the same property. I think that it is be OK to remove constructor fences associated with objects whose final fields are not accessed by another thread. Clearly an explicit fence in a constructor is entirely different. Just as fences that would otherwise be associated with thread-local volatiles or monitors can be removed, but explicit fences can at best be combined, as Doug suggests. On Mon, Nov 21, 2016 at 7:50 AM, Doug Lea
wrote: > On 11/21/2016 10:32 AM, Andrew Haley wrote: > > We're working on this right now in Graal. If an object does not >> escape, is it legitimate to remove the StoreStore fence as well? I >> > > See the cookbook section "Removing barriers" that covers > some of these cases. (http://gee.cs.oswego.edu/dl/jmm/cookbook.html) > > In general, don't just "remove" fences, instead move-and-merge them: > move them until they hit another that absorbs them (the same or > stronger). In many but not all cases, this does have the same effect > as just removing them. > > > > -Doug > > >