RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes

Lindenmaier, Goetz goetz.lindenmaier at sap.com
Fri Dec 20 04:47:10 PST 2013


Hi David,

> So we can at least undo #4 now we have established those tests were not 
> required to pass.
We would prefer if we could keep this in.  We want to avoid that it's
blamed on the VM if java programs are failing on PPC after they worked
on x86.  To clearly mark it as overfulfilling the spec I would guard it by 
a flag as proposed.  But if you insist I will remove it.  Also, this part is 
not that performance relevant.

> A compile-time guard (ifdef) would be better than a runtime one I think
I added a compile-time guard in this new webrev:
http://cr.openjdk.java.net/~goetz/webrevs/8029101-1-raw/
I've chosen CPU_NOT_MULTIPLE_COPY_ATOMIC.  This introduces 
several double negations I don't like, (#ifNdef CPU_NOT_MULTIPLE_COPY_ATOMIC) 
but this way I only have to change the ppc platform.

Best regards,
  Goetz

P.S.:  I will also be available over the Christmas period.

-----Original Message-----
From: David Holmes [mailto:david.holmes at oracle.com] 
Sent: Freitag, 20. Dezember 2013 05:58
To: Lindenmaier, Goetz
Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes

Sorry for the delay, it takes a while to catch up after two weeks 
vacation :) Next vacation (ie next two weeks) I'll continue to check emails.

On 2/12/2013 6:33 PM, Lindenmaier, Goetz wrote:
> Hi,
>
> ok, I understand the tests are wrong.  It's good this issue is settled.
> Thanks Aleksey and Andreas for going into the details of the proof!
>
> About our change:  David, the causality is the other way round.
> The change is about IRIW.
> 1. To pass IRIW, we must use sync instructions before loads.

This is the part I still have some question marks over as the 
implications are not nice for performance on non-TSO platforms. But I'm 
no further along in processing that paper I'm afraid.

> 2. If we do syncs before loads, we don't need to do them after stores.
> 3. If we don't do them after stores, we fail the volatile constructor tests.
> 4. So finally we added them again at the end of the constructor after stores
>       to pass the volatile constructor tests.

So we can at least undo #4 now we have established those tests were not 
required to pass.

> We originally passed the constructor tests because the ppc memory order
> instructions are not as find-granular as the
> operations in the IR.  MemBarVolatile is specified as StoreLoad.  The only instruction
> on PPC that does StoreLoad is sync. But sync also does StoreStore, therefore the
> MemBarVolatile after the store fixes the constructor tests.  The proper representation
> of the fix in the IR would be adding a MemBarStoreStore.  But now it's pointless
> anyways.
>
>> I'm not happy with the ifdef approach but I won't block it.
> I'd be happy to add a property
>    OrderAccess::cpu_is_multiple_copy_atomic()

A compile-time guard (ifdef) would be better than a runtime one I think 
- similar to the SUPPORTS_NATIVE_CX8 optimization (something semantic 
based not architecture based) as that will allows for turning this 
on/off for any architecture for testing purposes.

Thanks,
David

> or the like to guard the customization.  I'd like that much better. Or also
>     OrderAccess::needs_support_iriw_ordering()
>     VM_Version::needs_support_iriw_ordering()
>
>
> Best regards,
>    Goetz.
>
>
>
>
>
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Donnerstag, 28. November 2013 00:34
> To: Lindenmaier, Goetz
> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>
> TL;DR version:
>
> Discussion on the c-i list has now confirmed that a constructor-barrier
> for volatiles is not required as part of the JMM specification. It *may*
> be required in an implementation that doesn't pre-zero memory to ensure
> you can't see uninitialized fields. So the tests for this are invalid
> and this part of the patch is not needed in general (ppc64 may need it
> due to other factors).
>
> Re: "multiple copy atomicity" - first thanks for correcting the term :)
> Second thanks for the reference to that paper! For reference:
>
> "The memory system (perhaps involving a hierarchy of buffers and a
> complex interconnect) does not guarantee that a write becomes visible to
> all other hardware threads at the same time point; these architectures
> are not multiple-copy atomic."
>
> This is the visibility issue that I referred to and affects both ARM and
> PPC. But of course it is normally handled by using suitable barriers
> after the stores that need to be visible. I think the crux of the
> current issue is what you wrote below:
>
>   > The fixes for the constructor issue are only needed because we
>   > remove the sync instruction from behind stores (parse3.cpp:320)
>   > and place it before loads.
>
> I hadn't grasped this part. Obviously if you fail to do the sync after
> the store then you have to do something around the loads to get the same
> results! I still don't know what lead you to the conclusion that the
> only way to fix the IRIW issue was to put the fence before the load -
> maybe when I get the chance to read that paper in full it will be clearer.
>
> So ... the basic problem is that the current structure in the VM has
> hard-wired one choice of how to get the right semantics for volatile
> variables. You now want to customize that but not all the requisite
> hooks are present. It would be better if volatile_load and
> volatile_store were factored out so that they could be implemented as
> desired per-platform. Alternatively there could be pre- and post- hooks
> that could then be customized per platform. Otherwise you need
> platform-specific ifdef's to handle it as per your patch.
>
> I'm not happy with the ifdef approach but I won't block it. I think this
> is an area where a lot of clean up is needed in the VM. The barrier
> abstractions are a confused mess in my opinion.
>
> Thanks,
> David
> -----
>
> On 28/11/2013 3:15 AM, Lindenmaier, Goetz wrote:
>> Hi,
>>
>> I updated the webrev to fix the issues mentioned by Vladimir:
>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>
>> I did not yet add the
>>      OrderAccess::needs_support_iriw_ordering()
>>      VM_Version::needs_support_iriw_ordering()
>> or
>>     OrderAccess::cpu_is_multiple_copy_atomic()
>> to reduce #defined, as I got no further comment on that.
>>
>>
>> WRT to the validity of the tests and the interpretation of the JMM
>> I feel not in the position to contribute substantially.
>>
>> But we would like to pass the torture test suite as we consider
>> this a substantial task in implementing a PPC port.  Also we think
>> both tests show behavior a programmer would expect.  It's bad if
>> Java code runs fine on the more common x86 platform, and then
>> fails on ppc.  This will always first be blamed on the VM.
>>
>> The fixes for the constructor issue are only needed because we
>> remove the sync instruction from behind stores (parse3.cpp:320)
>> and place it before loads.  Then there is no sync between volatile store
>> and publishing the object.  So we add it again in this one case
>> (volatile store in constructor).
>>
>>
>> @David
>>>> Sure.  There also is no solution as you require for the taskqueue problem yet,
>>>> and that's being discussed now for almost a year.
>>> It may have started a year ago but work on it has hardly been continuous.
>> That's not true, we did a lot of investigation and testing on this issue.
>> And we came up with a solution we consider the best possible. If you
>> have objections, you should at least give the draft of a better solution,
>> we would volunteer to implement and test it.
>> Similarly, we invested time in fixing the concurrency torture issues.
>>
>> @David
>>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>>> can't find any reference to it.
>> We learned about this reading "A Tutorial Introduction to the ARM and
>> POWER Relaxed Memory Models" by Luc Maranget, Susmit Sarkar and
>> Peter Sewell, which is cited in "Correct and Efficient Work-Stealing for
>> Weak Memory Models" by Nhat Minh Lê, Antoniu Pop, Albert Cohen
>> and Francesco Zappa Nardelli (PPoPP `13) when analysing the taskqueue problem.
>> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>>
>> I was wrong in one thing, it's called multiple copy atomicity, I used 'read'
>> instead.  Sorry for that.  (I also fixed that in the method name above).
>>
>> Best regards and thanks for all your involvements,
>>     Goetz.
>>
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Mittwoch, 27. November 2013 12:53
>> To: Lindenmaier, Goetz
>> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>
>> Hi Goetz,
>>
>> On 26/11/2013 10:51 PM, Lindenmaier, Goetz wrote:
>>> Hi David,
>>>
>>> -- Volatile in constuctor
>>>> AFAIK we have not seen those tests fail due to a
>>>> missing constructor barrier.
>>> We see them on PPC64. Our test machines have typically 8-32 processors
>>> and are Power 5-7.  But see also Aleksey's mail.  (Thanks Aleksey!)
>>
>> And see follow ups - the tests are invalid.
>>
>>> -- IRIW issue
>>>> I can not possibly answer to the necessary level of detail with a few
>>>> moments thought.
>>> Sure.  There also is no solution as you require for the taskqueue problem yet,
>>> and that's being discussed now for almost a year.
>>
>> It may have started a year ago but work on it has hardly been continuous.
>>
>>>> You are implying there is a problem here that will
>>>> impact numerous platforms (unless you can tell me why ppc is so different?)
>>> No, only PPC does not have 'multiple-read-atomicity'.  Therefore I contributed a
>>> solution with the #defines, and that's correct for all, but not nice, I admit.
>>> (I don't really know about ARM, though).
>>> So if I can write down a nicer solution testing for methods that are evaluated
>>> by the C-compiler I'm happy.
>>>
>>> The problem is not that IRIW is not handled by the JMM, the problem
>>> is that
>>>      store
>>>      sync
>>> does not assure multiple-read-atomicity,
>>> only
>>>      sync
>>>      load
>>> does so on PPC.  And you require multiple-read-atomicity to
>>> pass that test.
>>
>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>> can't find any reference to it.
>>
>> Thanks,
>> David
>>
>>      The JMM is fine. And
>>>      store
>>>      MemBarVolatile
>>> is fine on x86, sparc etc. as there exist assembler instructions that
>>> do what is required.
>>>
>>> So if you are off soon, please let's come to a solution that
>>> might be improvable in the way it's implemented, but that
>>> allows us to implement a correct PPC64 port.
>>>
>>> Best regards,
>>>      Goetz.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>> Sent: Tuesday, November 26, 2013 1:11 PM
>>> To: Lindenmaier, Goetz
>>> Cc: 'Vladimir Kozlov'; 'Vitaly Davidovich'; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>>
>>> Hi Goetz,
>>>
>>> On 26/11/2013 9:22 PM, Lindenmaier, Goetz wrote:
>>>> Hi everybody,
>>>>
>>>> thanks a lot for the detailed reviews!
>>>> I'll try to answer to all in one mail.
>>>>
>>>>> Volatile fields written in constructor aren't guaranteed by JMM to occur before the reference is assigned;
>>>> We don't think it's correct if we omit the barrier after initializing
>>>> a volatile field.  Previously, we discussed this with Aleksey Shipilev
>>>> and Doug Lea, and they agreed.
>>>> Also, concurrency torture tests
>>>>        LongVolatileTest
>>>>        AtomicIntegerInitialValueTest
>>>> will fail.
>>>> (In addition, observing 0 instead of the inital value of a volatile field would be
>>>> very counter-intuitive for Java programmers, especially in AtomicInteger.)
>>>
>>> The affects of unsafe publication are always surprising - volatiles do
>>> not add anything special here. AFAIK there is nothing in the JMM that
>>> requires the constructor barrier - discussions with Doug and Aleksey
>>> notwithstanding. AFAIK we have not seen those tests fail due to a
>>> missing constructor barrier.
>>>
>>>>> proposed for PPC64 is to make volatile reads extremely heavyweight
>>>> Yes, it costs measurable performance.  But else it is wrong.  We don't
>>>> see a way to implement this cheaper.
>>>>
>>>>> - these algorithms should be expressed using the correct OrderAccess operations
>>>> Basically, I agree on this.  But you also have to take into account
>>>> that due to the different memory ordering instructions on different platforms
>>>> just implementing something empty is not sufficient.
>>>> An example:
>>>>       MemBarRelease    // means LoadStore, StoreStore barrier
>>>>       MemBarVolatile    // means  StoreLoad barrier
>>>> If these are consecutively in the code, sparc code looks like this:
>>>>       MemBarRelease  --> membar(Assembler::LoadStore | Assembler::StoreStore)
>>>>       MemBarVolatile --> membar(Assembler::StoreLoad)
>>>> Just doing what is required.
>>>> On Power, we get suboptimal code, as there are no comparable,
>>>> fine grained operations:
>>>>        MemBarRelease --> lwsync   // Doing LoadStore, StoreStore, LoadLoad
>>>>       MemBarVolatile --> sync    // // Doing LoadStore, StoreStore, LoadLoad, StoreLoad
>>>> obviously, the lwsync is superfluous.  Thus, as PPC operations are more (too) powerful,
>>>> I need an additional optimization that removes the lwsync.  I can not implement
>>>> MemBarRelease empty, as it is also used independently.
>>>>
>>>> Back to the IRIW problem.  I think here we have a comparable issue.
>>>> Doing the MemBarVolatile or the OrderAccess::fence() before the read
>>>> is inefficient on platforms that have multiple-read-atomicity.
>>>>
>>>> I would propose to guard the code by
>>>> VM_Version::cpu_is_multiple_read_atomic() or even better
>>>> OrderAccess::cpu_is_multiple_read_atomic()
>>>> Else, David, how would you propose to implement this platform independent?
>>>> (Maybe we can also use above method in taskqueue.hpp.)
>>>
>>> I can not possibly answer to the necessary level of detail with a few
>>> moments thought. You are implying there is a problem here that will
>>> impact numerous platforms (unless you can tell me why ppc is so
>>> different?) and I can not take that on face value at the moment. The
>>> only reason I can see IRIW not being handled by the JMM requirements for
>>> volatile accesses is if there are global visibility issues that are not
>>> addressed - but even then I would expect heavy barriers at the store
>>> would deal with that, not at the load. (This situation reminds me of the
>>> need for read-barriers on Alpha architecture due to the use of software
>>> cache-coherency rather than hardware cache-coherency - but we don't have
>>> that on ppc!)
>>>
>>> Sorry - There is no quick resolution here and in a couple of days I will
>>> be heading out on vacation for two weeks.
>>>
>>> David
>>> -----
>>>
>>>> Best regards,
>>>>       Goetz.
>>>>
>>>> -- Other ports:
>>>> The IRIW issue requires at least 3 processors to be relevant, so it might
>>>> not happen on small machines.  But I can use PPC_ONLY instead
>>>> of PPC64_ONLY if you request so (and if we don't get rid of them).
>>>>
>>>> -- MemBarStoreStore after initialization
>>>> I agree we should not change it in the ppc port.  If you wish, I can
>>>> prepare an extra webrev for hotspot-comp.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>>> Sent: Tuesday, November 26, 2013 2:49 AM
>>>> To: Vladimir Kozlov
>>>> Cc: Lindenmaier, Goetz; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>>>
>>>> Okay this is my second attempt at answering this in a reasonable way :)
>>>>
>>>> On 26/11/2013 10:51 AM, Vladimir Kozlov wrote:
>>>>> I have to ask David to do correctness evaluation.
>>>>
>>>>      From what I understand what we see here is an attempt to fix an
>>>> existing issue with the implementation of volatiles so that the IRIW
>>>> problem is addressed. The solution proposed for PPC64 is to make
>>>> volatile reads extremely heavyweight by adding a fence() when doing the
>>>> load.
>>>>
>>>> Now if this was purely handled in ppc64 source code then I would be
>>>> happy to let them do whatever they like (surely this kills performance
>>>> though!). But I do not agree with the changes to the shared code that
>>>> allow this solution to be implemented - even with PPC64_ONLY this is
>>>> polluting the shared code. My concern is similar to what I said with the
>>>> taskQueue changes - these algorithms should be expressed using the
>>>> correct OrderAccess operations to guarantee the desired properties
>>>> independent of architecture. If such a "barrier" is not needed on a
>>>> given architecture then the implementation in OrderAccess should reduce
>>>> to a no-op.
>>>>
>>>> And as Vitaly points out the constructor barriers are not needed under
>>>> the JMM.
>>>>
>>>>> I am fine with suggested changes because you did not change our current
>>>>> code for our platforms (please, do not change do_exits() now).
>>>>> But may be it should be done using more general query which is set
>>>>> depending on platform:
>>>>>
>>>>> OrderAccess::needs_support_iriw_ordering()
>>>>>
>>>>> or similar to what we use now:
>>>>>
>>>>> VM_Version::needs_support_iriw_ordering()
>>>>
>>>> Every platform has to support IRIW this is simply part of the Java
>>>> Memory Model, there should not be any need to call this out explicitly
>>>> like this.
>>>>
>>>> Is there some subtlety of the hardware I am missing here? Are there
>>>> visibility issues beyond the ordering constraints that the JMM defines?
>>>>>      From what I understand our ppc port is also affected. David?
>>>>
>>>> We can not discuss that on an OpenJDK mailing list - sorry.
>>>>
>>>> David
>>>> -----
>>>>
>>>>> In library_call.cpp can you add {}? New comment should be inside  else {}.
>>>>>
>>>>> I think you should make _wrote_volatile field not ppc64 specific which
>>>>> will be set to 'true' only on ppc64. Then you will not need PPC64_ONLY()
>>>>> except in do_put_xxx() where it is set to true. Too many #ifdefs.
>>>>>
>>>>> In do_put_xxx() can you combine your changes:
>>>>>
>>>>>         if (is_vol) {
>>>>>           // See comment in do_get_xxx().
>>>>> #ifndef PPC64
>>>>>           insert_mem_bar(Op_MemBarVolatile); // Use fat membar
>>>>> #else
>>>>>           if (is_field) {
>>>>>             // Add MemBarRelease for constructors which write volatile field
>>>>> (PPC64).
>>>>>             set_wrote_volatile(true);
>>>>>           }
>>>>> #endif
>>>>>         }
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 11/25/13 8:16 AM, Lindenmaier, Goetz wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I preprared a webrev with fixes for PPC for the VolatileIRIWTest of
>>>>>> the torture test suite:
>>>>>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>>>>>
>>>>>> Example:
>>>>>> volatile x=0, y=0
>>>>>>        __________    __________    __________    __________
>>>>>> | Thread 0 |  | Thread 1 |  | Thread 2 |  | Thread 3 |
>>>>>>
>>>>>>         write(x=1)    read(x)       write(y=1)    read(y)
>>>>>>                       read(y)                     read(x)
>>>>>>
>>>>>> Disallowed:     x=1, y=0                    y=1, x=0
>>>>>>
>>>>>>
>>>>>> Solution: This example requires multiple-copy-atomicity. This is only
>>>>>> assured by the sync instruction and if it is executed in the threads
>>>>>> doing the loads. Thus we implement volatile read as sync-load-acquire
>>>>>> and omit the sync/MemBarVolatile after the volatile store.
>>>>>> MemBarVolatile happens to be implemented by sync.
>>>>>> We fix this in C2 and the cpp interpreter.
>>>>>>
>>>>>> This addresses a similar issue as fix "8012144: multiple SIGSEGVs
>>>>>> fails on staxf" for taskqueue.hpp.
>>>>>>
>>>>>> Further this change contains a fix that assures that volatile fields
>>>>>> written in constructors are visible before the reference gets
>>>>>> published.
>>>>>>
>>>>>>
>>>>>> Looking at the code, we found a MemBarRelease that to us, seems too
>>>>>> strong.
>>>>>> We think in parse1.cpp do_exits() a MemBarStoreStore should suffice.
>>>>>> What do you think?
>>>>>>
>>>>>> Please review and test this change.
>>>>>>
>>>>>> Best regards,
>>>>>>        Goetz.
>>>>>>


More information about the hotspot-dev mailing list