RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes

David Holmes david.holmes at oracle.com
Thu Dec 19 20:57:58 PST 2013


Sorry for the delay, it takes a while to catch up after two weeks 
vacation :) Next vacation (ie next two weeks) I'll continue to check emails.

On 2/12/2013 6:33 PM, Lindenmaier, Goetz wrote:
> Hi,
>
> ok, I understand the tests are wrong.  It's good this issue is settled.
> Thanks Aleksey and Andreas for going into the details of the proof!
>
> About our change:  David, the causality is the other way round.
> The change is about IRIW.
> 1. To pass IRIW, we must use sync instructions before loads.

This is the part I still have some question marks over as the 
implications are not nice for performance on non-TSO platforms. But I'm 
no further along in processing that paper I'm afraid.

> 2. If we do syncs before loads, we don't need to do them after stores.
> 3. If we don't do them after stores, we fail the volatile constructor tests.
> 4. So finally we added them again at the end of the constructor after stores
>       to pass the volatile constructor tests.

So we can at least undo #4 now we have established those tests were not 
required to pass.

> We originally passed the constructor tests because the ppc memory order
> instructions are not as find-granular as the
> operations in the IR.  MemBarVolatile is specified as StoreLoad.  The only instruction
> on PPC that does StoreLoad is sync. But sync also does StoreStore, therefore the
> MemBarVolatile after the store fixes the constructor tests.  The proper representation
> of the fix in the IR would be adding a MemBarStoreStore.  But now it's pointless
> anyways.
>
>> I'm not happy with the ifdef approach but I won't block it.
> I'd be happy to add a property
>    OrderAccess::cpu_is_multiple_copy_atomic()

A compile-time guard (ifdef) would be better than a runtime one I think 
- similar to the SUPPORTS_NATIVE_CX8 optimization (something semantic 
based not architecture based) as that will allows for turning this 
on/off for any architecture for testing purposes.

Thanks,
David

> or the like to guard the customization.  I'd like that much better. Or also
>     OrderAccess::needs_support_iriw_ordering()
>     VM_Version::needs_support_iriw_ordering()
>
>
> Best regards,
>    Goetz.
>
>
>
>
>
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Donnerstag, 28. November 2013 00:34
> To: Lindenmaier, Goetz
> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>
> TL;DR version:
>
> Discussion on the c-i list has now confirmed that a constructor-barrier
> for volatiles is not required as part of the JMM specification. It *may*
> be required in an implementation that doesn't pre-zero memory to ensure
> you can't see uninitialized fields. So the tests for this are invalid
> and this part of the patch is not needed in general (ppc64 may need it
> due to other factors).
>
> Re: "multiple copy atomicity" - first thanks for correcting the term :)
> Second thanks for the reference to that paper! For reference:
>
> "The memory system (perhaps involving a hierarchy of buffers and a
> complex interconnect) does not guarantee that a write becomes visible to
> all other hardware threads at the same time point; these architectures
> are not multiple-copy atomic."
>
> This is the visibility issue that I referred to and affects both ARM and
> PPC. But of course it is normally handled by using suitable barriers
> after the stores that need to be visible. I think the crux of the
> current issue is what you wrote below:
>
>   > The fixes for the constructor issue are only needed because we
>   > remove the sync instruction from behind stores (parse3.cpp:320)
>   > and place it before loads.
>
> I hadn't grasped this part. Obviously if you fail to do the sync after
> the store then you have to do something around the loads to get the same
> results! I still don't know what lead you to the conclusion that the
> only way to fix the IRIW issue was to put the fence before the load -
> maybe when I get the chance to read that paper in full it will be clearer.
>
> So ... the basic problem is that the current structure in the VM has
> hard-wired one choice of how to get the right semantics for volatile
> variables. You now want to customize that but not all the requisite
> hooks are present. It would be better if volatile_load and
> volatile_store were factored out so that they could be implemented as
> desired per-platform. Alternatively there could be pre- and post- hooks
> that could then be customized per platform. Otherwise you need
> platform-specific ifdef's to handle it as per your patch.
>
> I'm not happy with the ifdef approach but I won't block it. I think this
> is an area where a lot of clean up is needed in the VM. The barrier
> abstractions are a confused mess in my opinion.
>
> Thanks,
> David
> -----
>
> On 28/11/2013 3:15 AM, Lindenmaier, Goetz wrote:
>> Hi,
>>
>> I updated the webrev to fix the issues mentioned by Vladimir:
>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>
>> I did not yet add the
>>      OrderAccess::needs_support_iriw_ordering()
>>      VM_Version::needs_support_iriw_ordering()
>> or
>>     OrderAccess::cpu_is_multiple_copy_atomic()
>> to reduce #defined, as I got no further comment on that.
>>
>>
>> WRT to the validity of the tests and the interpretation of the JMM
>> I feel not in the position to contribute substantially.
>>
>> But we would like to pass the torture test suite as we consider
>> this a substantial task in implementing a PPC port.  Also we think
>> both tests show behavior a programmer would expect.  It's bad if
>> Java code runs fine on the more common x86 platform, and then
>> fails on ppc.  This will always first be blamed on the VM.
>>
>> The fixes for the constructor issue are only needed because we
>> remove the sync instruction from behind stores (parse3.cpp:320)
>> and place it before loads.  Then there is no sync between volatile store
>> and publishing the object.  So we add it again in this one case
>> (volatile store in constructor).
>>
>>
>> @David
>>>> Sure.  There also is no solution as you require for the taskqueue problem yet,
>>>> and that's being discussed now for almost a year.
>>> It may have started a year ago but work on it has hardly been continuous.
>> That's not true, we did a lot of investigation and testing on this issue.
>> And we came up with a solution we consider the best possible. If you
>> have objections, you should at least give the draft of a better solution,
>> we would volunteer to implement and test it.
>> Similarly, we invested time in fixing the concurrency torture issues.
>>
>> @David
>>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>>> can't find any reference to it.
>> We learned about this reading "A Tutorial Introduction to the ARM and
>> POWER Relaxed Memory Models" by Luc Maranget, Susmit Sarkar and
>> Peter Sewell, which is cited in "Correct and Efficient Work-Stealing for
>> Weak Memory Models" by Nhat Minh Lê, Antoniu Pop, Albert Cohen
>> and Francesco Zappa Nardelli (PPoPP `13) when analysing the taskqueue problem.
>> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>>
>> I was wrong in one thing, it's called multiple copy atomicity, I used 'read'
>> instead.  Sorry for that.  (I also fixed that in the method name above).
>>
>> Best regards and thanks for all your involvements,
>>     Goetz.
>>
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Mittwoch, 27. November 2013 12:53
>> To: Lindenmaier, Goetz
>> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>
>> Hi Goetz,
>>
>> On 26/11/2013 10:51 PM, Lindenmaier, Goetz wrote:
>>> Hi David,
>>>
>>> -- Volatile in constuctor
>>>> AFAIK we have not seen those tests fail due to a
>>>> missing constructor barrier.
>>> We see them on PPC64. Our test machines have typically 8-32 processors
>>> and are Power 5-7.  But see also Aleksey's mail.  (Thanks Aleksey!)
>>
>> And see follow ups - the tests are invalid.
>>
>>> -- IRIW issue
>>>> I can not possibly answer to the necessary level of detail with a few
>>>> moments thought.
>>> Sure.  There also is no solution as you require for the taskqueue problem yet,
>>> and that's being discussed now for almost a year.
>>
>> It may have started a year ago but work on it has hardly been continuous.
>>
>>>> You are implying there is a problem here that will
>>>> impact numerous platforms (unless you can tell me why ppc is so different?)
>>> No, only PPC does not have 'multiple-read-atomicity'.  Therefore I contributed a
>>> solution with the #defines, and that's correct for all, but not nice, I admit.
>>> (I don't really know about ARM, though).
>>> So if I can write down a nicer solution testing for methods that are evaluated
>>> by the C-compiler I'm happy.
>>>
>>> The problem is not that IRIW is not handled by the JMM, the problem
>>> is that
>>>      store
>>>      sync
>>> does not assure multiple-read-atomicity,
>>> only
>>>      sync
>>>      load
>>> does so on PPC.  And you require multiple-read-atomicity to
>>> pass that test.
>>
>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>> can't find any reference to it.
>>
>> Thanks,
>> David
>>
>>      The JMM is fine. And
>>>      store
>>>      MemBarVolatile
>>> is fine on x86, sparc etc. as there exist assembler instructions that
>>> do what is required.
>>>
>>> So if you are off soon, please let's come to a solution that
>>> might be improvable in the way it's implemented, but that
>>> allows us to implement a correct PPC64 port.
>>>
>>> Best regards,
>>>      Goetz.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>> Sent: Tuesday, November 26, 2013 1:11 PM
>>> To: Lindenmaier, Goetz
>>> Cc: 'Vladimir Kozlov'; 'Vitaly Davidovich'; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>>
>>> Hi Goetz,
>>>
>>> On 26/11/2013 9:22 PM, Lindenmaier, Goetz wrote:
>>>> Hi everybody,
>>>>
>>>> thanks a lot for the detailed reviews!
>>>> I'll try to answer to all in one mail.
>>>>
>>>>> Volatile fields written in constructor aren't guaranteed by JMM to occur before the reference is assigned;
>>>> We don't think it's correct if we omit the barrier after initializing
>>>> a volatile field.  Previously, we discussed this with Aleksey Shipilev
>>>> and Doug Lea, and they agreed.
>>>> Also, concurrency torture tests
>>>>        LongVolatileTest
>>>>        AtomicIntegerInitialValueTest
>>>> will fail.
>>>> (In addition, observing 0 instead of the inital value of a volatile field would be
>>>> very counter-intuitive for Java programmers, especially in AtomicInteger.)
>>>
>>> The affects of unsafe publication are always surprising - volatiles do
>>> not add anything special here. AFAIK there is nothing in the JMM that
>>> requires the constructor barrier - discussions with Doug and Aleksey
>>> notwithstanding. AFAIK we have not seen those tests fail due to a
>>> missing constructor barrier.
>>>
>>>>> proposed for PPC64 is to make volatile reads extremely heavyweight
>>>> Yes, it costs measurable performance.  But else it is wrong.  We don't
>>>> see a way to implement this cheaper.
>>>>
>>>>> - these algorithms should be expressed using the correct OrderAccess operations
>>>> Basically, I agree on this.  But you also have to take into account
>>>> that due to the different memory ordering instructions on different platforms
>>>> just implementing something empty is not sufficient.
>>>> An example:
>>>>       MemBarRelease    // means LoadStore, StoreStore barrier
>>>>       MemBarVolatile    // means  StoreLoad barrier
>>>> If these are consecutively in the code, sparc code looks like this:
>>>>       MemBarRelease  --> membar(Assembler::LoadStore | Assembler::StoreStore)
>>>>       MemBarVolatile --> membar(Assembler::StoreLoad)
>>>> Just doing what is required.
>>>> On Power, we get suboptimal code, as there are no comparable,
>>>> fine grained operations:
>>>>        MemBarRelease --> lwsync   // Doing LoadStore, StoreStore, LoadLoad
>>>>       MemBarVolatile --> sync    // // Doing LoadStore, StoreStore, LoadLoad, StoreLoad
>>>> obviously, the lwsync is superfluous.  Thus, as PPC operations are more (too) powerful,
>>>> I need an additional optimization that removes the lwsync.  I can not implement
>>>> MemBarRelease empty, as it is also used independently.
>>>>
>>>> Back to the IRIW problem.  I think here we have a comparable issue.
>>>> Doing the MemBarVolatile or the OrderAccess::fence() before the read
>>>> is inefficient on platforms that have multiple-read-atomicity.
>>>>
>>>> I would propose to guard the code by
>>>> VM_Version::cpu_is_multiple_read_atomic() or even better
>>>> OrderAccess::cpu_is_multiple_read_atomic()
>>>> Else, David, how would you propose to implement this platform independent?
>>>> (Maybe we can also use above method in taskqueue.hpp.)
>>>
>>> I can not possibly answer to the necessary level of detail with a few
>>> moments thought. You are implying there is a problem here that will
>>> impact numerous platforms (unless you can tell me why ppc is so
>>> different?) and I can not take that on face value at the moment. The
>>> only reason I can see IRIW not being handled by the JMM requirements for
>>> volatile accesses is if there are global visibility issues that are not
>>> addressed - but even then I would expect heavy barriers at the store
>>> would deal with that, not at the load. (This situation reminds me of the
>>> need for read-barriers on Alpha architecture due to the use of software
>>> cache-coherency rather than hardware cache-coherency - but we don't have
>>> that on ppc!)
>>>
>>> Sorry - There is no quick resolution here and in a couple of days I will
>>> be heading out on vacation for two weeks.
>>>
>>> David
>>> -----
>>>
>>>> Best regards,
>>>>       Goetz.
>>>>
>>>> -- Other ports:
>>>> The IRIW issue requires at least 3 processors to be relevant, so it might
>>>> not happen on small machines.  But I can use PPC_ONLY instead
>>>> of PPC64_ONLY if you request so (and if we don't get rid of them).
>>>>
>>>> -- MemBarStoreStore after initialization
>>>> I agree we should not change it in the ppc port.  If you wish, I can
>>>> prepare an extra webrev for hotspot-comp.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>>> Sent: Tuesday, November 26, 2013 2:49 AM
>>>> To: Vladimir Kozlov
>>>> Cc: Lindenmaier, Goetz; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>>>
>>>> Okay this is my second attempt at answering this in a reasonable way :)
>>>>
>>>> On 26/11/2013 10:51 AM, Vladimir Kozlov wrote:
>>>>> I have to ask David to do correctness evaluation.
>>>>
>>>>      From what I understand what we see here is an attempt to fix an
>>>> existing issue with the implementation of volatiles so that the IRIW
>>>> problem is addressed. The solution proposed for PPC64 is to make
>>>> volatile reads extremely heavyweight by adding a fence() when doing the
>>>> load.
>>>>
>>>> Now if this was purely handled in ppc64 source code then I would be
>>>> happy to let them do whatever they like (surely this kills performance
>>>> though!). But I do not agree with the changes to the shared code that
>>>> allow this solution to be implemented - even with PPC64_ONLY this is
>>>> polluting the shared code. My concern is similar to what I said with the
>>>> taskQueue changes - these algorithms should be expressed using the
>>>> correct OrderAccess operations to guarantee the desired properties
>>>> independent of architecture. If such a "barrier" is not needed on a
>>>> given architecture then the implementation in OrderAccess should reduce
>>>> to a no-op.
>>>>
>>>> And as Vitaly points out the constructor barriers are not needed under
>>>> the JMM.
>>>>
>>>>> I am fine with suggested changes because you did not change our current
>>>>> code for our platforms (please, do not change do_exits() now).
>>>>> But may be it should be done using more general query which is set
>>>>> depending on platform:
>>>>>
>>>>> OrderAccess::needs_support_iriw_ordering()
>>>>>
>>>>> or similar to what we use now:
>>>>>
>>>>> VM_Version::needs_support_iriw_ordering()
>>>>
>>>> Every platform has to support IRIW this is simply part of the Java
>>>> Memory Model, there should not be any need to call this out explicitly
>>>> like this.
>>>>
>>>> Is there some subtlety of the hardware I am missing here? Are there
>>>> visibility issues beyond the ordering constraints that the JMM defines?
>>>>>      From what I understand our ppc port is also affected. David?
>>>>
>>>> We can not discuss that on an OpenJDK mailing list - sorry.
>>>>
>>>> David
>>>> -----
>>>>
>>>>> In library_call.cpp can you add {}? New comment should be inside  else {}.
>>>>>
>>>>> I think you should make _wrote_volatile field not ppc64 specific which
>>>>> will be set to 'true' only on ppc64. Then you will not need PPC64_ONLY()
>>>>> except in do_put_xxx() where it is set to true. Too many #ifdefs.
>>>>>
>>>>> In do_put_xxx() can you combine your changes:
>>>>>
>>>>>         if (is_vol) {
>>>>>           // See comment in do_get_xxx().
>>>>> #ifndef PPC64
>>>>>           insert_mem_bar(Op_MemBarVolatile); // Use fat membar
>>>>> #else
>>>>>           if (is_field) {
>>>>>             // Add MemBarRelease for constructors which write volatile field
>>>>> (PPC64).
>>>>>             set_wrote_volatile(true);
>>>>>           }
>>>>> #endif
>>>>>         }
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 11/25/13 8:16 AM, Lindenmaier, Goetz wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I preprared a webrev with fixes for PPC for the VolatileIRIWTest of
>>>>>> the torture test suite:
>>>>>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>>>>>
>>>>>> Example:
>>>>>> volatile x=0, y=0
>>>>>>        __________    __________    __________    __________
>>>>>> | Thread 0 |  | Thread 1 |  | Thread 2 |  | Thread 3 |
>>>>>>
>>>>>>         write(x=1)    read(x)       write(y=1)    read(y)
>>>>>>                       read(y)                     read(x)
>>>>>>
>>>>>> Disallowed:     x=1, y=0                    y=1, x=0
>>>>>>
>>>>>>
>>>>>> Solution: This example requires multiple-copy-atomicity. This is only
>>>>>> assured by the sync instruction and if it is executed in the threads
>>>>>> doing the loads. Thus we implement volatile read as sync-load-acquire
>>>>>> and omit the sync/MemBarVolatile after the volatile store.
>>>>>> MemBarVolatile happens to be implemented by sync.
>>>>>> We fix this in C2 and the cpp interpreter.
>>>>>>
>>>>>> This addresses a similar issue as fix "8012144: multiple SIGSEGVs
>>>>>> fails on staxf" for taskqueue.hpp.
>>>>>>
>>>>>> Further this change contains a fix that assures that volatile fields
>>>>>> written in constructors are visible before the reference gets
>>>>>> published.
>>>>>>
>>>>>>
>>>>>> Looking at the code, we found a MemBarRelease that to us, seems too
>>>>>> strong.
>>>>>> We think in parse1.cpp do_exits() a MemBarStoreStore should suffice.
>>>>>> What do you think?
>>>>>>
>>>>>> Please review and test this change.
>>>>>>
>>>>>> Best regards,
>>>>>>        Goetz.
>>>>>>


More information about the ppc-aix-port-dev mailing list