RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
Vitaly Davidovich
vitalyd at gmail.com
Wed Nov 27 17:25:33 PST 2013
Just want to say (as I did on the c-i thread) that having to emit a fence
for non pre zeroing implementations would need to happen for normal fields
as well, so I don't even see how that part is at all relevant to volatiles.
Sent from my phone
On Nov 27, 2013 6:35 PM, "David Holmes" <david.holmes at oracle.com> wrote:
> TL;DR version:
>
> Discussion on the c-i list has now confirmed that a constructor-barrier
> for volatiles is not required as part of the JMM specification. It *may* be
> required in an implementation that doesn't pre-zero memory to ensure you
> can't see uninitialized fields. So the tests for this are invalid and this
> part of the patch is not needed in general (ppc64 may need it due to other
> factors).
>
> Re: "multiple copy atomicity" - first thanks for correcting the term :)
> Second thanks for the reference to that paper! For reference:
>
> "The memory system (perhaps involving a hierarchy of buffers and a complex
> interconnect) does not guarantee that a write becomes visible to all other
> hardware threads at the same time point; these architectures are not
> multiple-copy atomic."
>
> This is the visibility issue that I referred to and affects both ARM and
> PPC. But of course it is normally handled by using suitable barriers after
> the stores that need to be visible. I think the crux of the current issue
> is what you wrote below:
>
> > The fixes for the constructor issue are only needed because we
> > remove the sync instruction from behind stores (parse3.cpp:320)
> > and place it before loads.
>
> I hadn't grasped this part. Obviously if you fail to do the sync after the
> store then you have to do something around the loads to get the same
> results! I still don't know what lead you to the conclusion that the only
> way to fix the IRIW issue was to put the fence before the load - maybe when
> I get the chance to read that paper in full it will be clearer.
>
> So ... the basic problem is that the current structure in the VM has
> hard-wired one choice of how to get the right semantics for volatile
> variables. You now want to customize that but not all the requisite hooks
> are present. It would be better if volatile_load and volatile_store were
> factored out so that they could be implemented as desired per-platform.
> Alternatively there could be pre- and post- hooks that could then be
> customized per platform. Otherwise you need platform-specific ifdef's to
> handle it as per your patch.
>
> I'm not happy with the ifdef approach but I won't block it. I think this
> is an area where a lot of clean up is needed in the VM. The barrier
> abstractions are a confused mess in my opinion.
>
> Thanks,
> David
> -----
>
> On 28/11/2013 3:15 AM, Lindenmaier, Goetz wrote:
>
>> Hi,
>>
>> I updated the webrev to fix the issues mentioned by Vladimir:
>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>
>> I did not yet add the
>> OrderAccess::needs_support_iriw_ordering()
>> VM_Version::needs_support_iriw_ordering()
>> or
>> OrderAccess::cpu_is_multiple_copy_atomic()
>> to reduce #defined, as I got no further comment on that.
>>
>>
>> WRT to the validity of the tests and the interpretation of the JMM
>> I feel not in the position to contribute substantially.
>>
>> But we would like to pass the torture test suite as we consider
>> this a substantial task in implementing a PPC port. Also we think
>> both tests show behavior a programmer would expect. It's bad if
>> Java code runs fine on the more common x86 platform, and then
>> fails on ppc. This will always first be blamed on the VM.
>>
>> The fixes for the constructor issue are only needed because we
>> remove the sync instruction from behind stores (parse3.cpp:320)
>> and place it before loads. Then there is no sync between volatile store
>> and publishing the object. So we add it again in this one case
>> (volatile store in constructor).
>>
>>
>> @David
>>
>>> Sure. There also is no solution as you require for the taskqueue
>>>> problem yet,
>>>> and that's being discussed now for almost a year.
>>>>
>>> It may have started a year ago but work on it has hardly been continuous.
>>>
>> That's not true, we did a lot of investigation and testing on this issue.
>> And we came up with a solution we consider the best possible. If you
>> have objections, you should at least give the draft of a better solution,
>> we would volunteer to implement and test it.
>> Similarly, we invested time in fixing the concurrency torture issues.
>>
>> @David
>>
>>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>>> can't find any reference to it.
>>>
>> We learned about this reading "A Tutorial Introduction to the ARM and
>> POWER Relaxed Memory Models" by Luc Maranget, Susmit Sarkar and
>> Peter Sewell, which is cited in "Correct and Efficient Work-Stealing for
>> Weak Memory Models" by Nhat Minh Lê, Antoniu Pop, Albert Cohen
>> and Francesco Zappa Nardelli (PPoPP `13) when analysing the taskqueue
>> problem.
>> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>>
>> I was wrong in one thing, it's called multiple copy atomicity, I used
>> 'read'
>> instead. Sorry for that. (I also fixed that in the method name above).
>>
>> Best regards and thanks for all your involvements,
>> Goetz.
>>
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Mittwoch, 27. November 2013 12:53
>> To: Lindenmaier, Goetz
>> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent
>> Reads of Independent Writes
>>
>> Hi Goetz,
>>
>> On 26/11/2013 10:51 PM, Lindenmaier, Goetz wrote:
>>
>>> Hi David,
>>>
>>> -- Volatile in constuctor
>>>
>>>> AFAIK we have not seen those tests fail due to a
>>>> missing constructor barrier.
>>>>
>>> We see them on PPC64. Our test machines have typically 8-32 processors
>>> and are Power 5-7. But see also Aleksey's mail. (Thanks Aleksey!)
>>>
>>
>> And see follow ups - the tests are invalid.
>>
>> -- IRIW issue
>>>
>>>> I can not possibly answer to the necessary level of detail with a few
>>>> moments thought.
>>>>
>>> Sure. There also is no solution as you require for the taskqueue
>>> problem yet,
>>> and that's being discussed now for almost a year.
>>>
>>
>> It may have started a year ago but work on it has hardly been continuous.
>>
>> You are implying there is a problem here that will
>>>> impact numerous platforms (unless you can tell me why ppc is so
>>>> different?)
>>>>
>>> No, only PPC does not have 'multiple-read-atomicity'. Therefore I
>>> contributed a
>>> solution with the #defines, and that's correct for all, but not nice, I
>>> admit.
>>> (I don't really know about ARM, though).
>>> So if I can write down a nicer solution testing for methods that are
>>> evaluated
>>> by the C-compiler I'm happy.
>>>
>>> The problem is not that IRIW is not handled by the JMM, the problem
>>> is that
>>> store
>>> sync
>>> does not assure multiple-read-atomicity,
>>> only
>>> sync
>>> load
>>> does so on PPC. And you require multiple-read-atomicity to
>>> pass that test.
>>>
>>
>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>> can't find any reference to it.
>>
>> Thanks,
>> David
>>
>> The JMM is fine. And
>>
>>> store
>>> MemBarVolatile
>>> is fine on x86, sparc etc. as there exist assembler instructions that
>>> do what is required.
>>>
>>> So if you are off soon, please let's come to a solution that
>>> might be improvable in the way it's implemented, but that
>>> allows us to implement a correct PPC64 port.
>>>
>>> Best regards,
>>> Goetz.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>> Sent: Tuesday, November 26, 2013 1:11 PM
>>> To: Lindenmaier, Goetz
>>> Cc: 'Vladimir Kozlov'; 'Vitaly Davidovich'; '
>>> hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent
>>> Reads of Independent Writes
>>>
>>> Hi Goetz,
>>>
>>> On 26/11/2013 9:22 PM, Lindenmaier, Goetz wrote:
>>>
>>>> Hi everybody,
>>>>
>>>> thanks a lot for the detailed reviews!
>>>> I'll try to answer to all in one mail.
>>>>
>>>> Volatile fields written in constructor aren't guaranteed by JMM to
>>>>> occur before the reference is assigned;
>>>>>
>>>> We don't think it's correct if we omit the barrier after initializing
>>>> a volatile field. Previously, we discussed this with Aleksey Shipilev
>>>> and Doug Lea, and they agreed.
>>>> Also, concurrency torture tests
>>>> LongVolatileTest
>>>> AtomicIntegerInitialValueTest
>>>> will fail.
>>>> (In addition, observing 0 instead of the inital value of a volatile
>>>> field would be
>>>> very counter-intuitive for Java programmers, especially in
>>>> AtomicInteger.)
>>>>
>>>
>>> The affects of unsafe publication are always surprising - volatiles do
>>> not add anything special here. AFAIK there is nothing in the JMM that
>>> requires the constructor barrier - discussions with Doug and Aleksey
>>> notwithstanding. AFAIK we have not seen those tests fail due to a
>>> missing constructor barrier.
>>>
>>> proposed for PPC64 is to make volatile reads extremely heavyweight
>>>>>
>>>> Yes, it costs measurable performance. But else it is wrong. We don't
>>>> see a way to implement this cheaper.
>>>>
>>>> - these algorithms should be expressed using the correct OrderAccess
>>>>> operations
>>>>>
>>>> Basically, I agree on this. But you also have to take into account
>>>> that due to the different memory ordering instructions on different
>>>> platforms
>>>> just implementing something empty is not sufficient.
>>>> An example:
>>>> MemBarRelease // means LoadStore, StoreStore barrier
>>>> MemBarVolatile // means StoreLoad barrier
>>>> If these are consecutively in the code, sparc code looks like this:
>>>> MemBarRelease --> membar(Assembler::LoadStore |
>>>> Assembler::StoreStore)
>>>> MemBarVolatile --> membar(Assembler::StoreLoad)
>>>> Just doing what is required.
>>>> On Power, we get suboptimal code, as there are no comparable,
>>>> fine grained operations:
>>>> MemBarRelease --> lwsync // Doing LoadStore, StoreStore,
>>>> LoadLoad
>>>> MemBarVolatile --> sync // // Doing LoadStore, StoreStore,
>>>> LoadLoad, StoreLoad
>>>> obviously, the lwsync is superfluous. Thus, as PPC operations are more
>>>> (too) powerful,
>>>> I need an additional optimization that removes the lwsync. I can not
>>>> implement
>>>> MemBarRelease empty, as it is also used independently.
>>>>
>>>> Back to the IRIW problem. I think here we have a comparable issue.
>>>> Doing the MemBarVolatile or the OrderAccess::fence() before the read
>>>> is inefficient on platforms that have multiple-read-atomicity.
>>>>
>>>> I would propose to guard the code by
>>>> VM_Version::cpu_is_multiple_read_atomic() or even better
>>>> OrderAccess::cpu_is_multiple_read_atomic()
>>>> Else, David, how would you propose to implement this platform
>>>> independent?
>>>> (Maybe we can also use above method in taskqueue.hpp.)
>>>>
>>>
>>> I can not possibly answer to the necessary level of detail with a few
>>> moments thought. You are implying there is a problem here that will
>>> impact numerous platforms (unless you can tell me why ppc is so
>>> different?) and I can not take that on face value at the moment. The
>>> only reason I can see IRIW not being handled by the JMM requirements for
>>> volatile accesses is if there are global visibility issues that are not
>>> addressed - but even then I would expect heavy barriers at the store
>>> would deal with that, not at the load. (This situation reminds me of the
>>> need for read-barriers on Alpha architecture due to the use of software
>>> cache-coherency rather than hardware cache-coherency - but we don't have
>>> that on ppc!)
>>>
>>> Sorry - There is no quick resolution here and in a couple of days I will
>>> be heading out on vacation for two weeks.
>>>
>>> David
>>> -----
>>>
>>> Best regards,
>>>> Goetz.
>>>>
>>>> -- Other ports:
>>>> The IRIW issue requires at least 3 processors to be relevant, so it
>>>> might
>>>> not happen on small machines. But I can use PPC_ONLY instead
>>>> of PPC64_ONLY if you request so (and if we don't get rid of them).
>>>>
>>>> -- MemBarStoreStore after initialization
>>>> I agree we should not change it in the ppc port. If you wish, I can
>>>> prepare an extra webrev for hotspot-comp.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>>> Sent: Tuesday, November 26, 2013 2:49 AM
>>>> To: Vladimir Kozlov
>>>> Cc: Lindenmaier, Goetz; 'hotspot-dev at openjdk.java.net'; '
>>>> ppc-aix-port-dev at openjdk.java.net'
>>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent
>>>> Reads of Independent Writes
>>>>
>>>> Okay this is my second attempt at answering this in a reasonable way :)
>>>>
>>>> On 26/11/2013 10:51 AM, Vladimir Kozlov wrote:
>>>>
>>>>> I have to ask David to do correctness evaluation.
>>>>>
>>>>
>>>> From what I understand what we see here is an attempt to fix an
>>>> existing issue with the implementation of volatiles so that the IRIW
>>>> problem is addressed. The solution proposed for PPC64 is to make
>>>> volatile reads extremely heavyweight by adding a fence() when doing the
>>>> load.
>>>>
>>>> Now if this was purely handled in ppc64 source code then I would be
>>>> happy to let them do whatever they like (surely this kills performance
>>>> though!). But I do not agree with the changes to the shared code that
>>>> allow this solution to be implemented - even with PPC64_ONLY this is
>>>> polluting the shared code. My concern is similar to what I said with the
>>>> taskQueue changes - these algorithms should be expressed using the
>>>> correct OrderAccess operations to guarantee the desired properties
>>>> independent of architecture. If such a "barrier" is not needed on a
>>>> given architecture then the implementation in OrderAccess should reduce
>>>> to a no-op.
>>>>
>>>> And as Vitaly points out the constructor barriers are not needed under
>>>> the JMM.
>>>>
>>>> I am fine with suggested changes because you did not change our current
>>>>> code for our platforms (please, do not change do_exits() now).
>>>>> But may be it should be done using more general query which is set
>>>>> depending on platform:
>>>>>
>>>>> OrderAccess::needs_support_iriw_ordering()
>>>>>
>>>>> or similar to what we use now:
>>>>>
>>>>> VM_Version::needs_support_iriw_ordering()
>>>>>
>>>>
>>>> Every platform has to support IRIW this is simply part of the Java
>>>> Memory Model, there should not be any need to call this out explicitly
>>>> like this.
>>>>
>>>> Is there some subtlety of the hardware I am missing here? Are there
>>>> visibility issues beyond the ordering constraints that the JMM defines?
>>>>
>>>>> From what I understand our ppc port is also affected. David?
>>>>>
>>>>
>>>> We can not discuss that on an OpenJDK mailing list - sorry.
>>>>
>>>> David
>>>> -----
>>>>
>>>> In library_call.cpp can you add {}? New comment should be inside else
>>>>> {}.
>>>>>
>>>>> I think you should make _wrote_volatile field not ppc64 specific which
>>>>> will be set to 'true' only on ppc64. Then you will not need
>>>>> PPC64_ONLY()
>>>>> except in do_put_xxx() where it is set to true. Too many #ifdefs.
>>>>>
>>>>> In do_put_xxx() can you combine your changes:
>>>>>
>>>>> if (is_vol) {
>>>>> // See comment in do_get_xxx().
>>>>> #ifndef PPC64
>>>>> insert_mem_bar(Op_MemBarVolatile); // Use fat membar
>>>>> #else
>>>>> if (is_field) {
>>>>> // Add MemBarRelease for constructors which write volatile
>>>>> field
>>>>> (PPC64).
>>>>> set_wrote_volatile(true);
>>>>> }
>>>>> #endif
>>>>> }
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 11/25/13 8:16 AM, Lindenmaier, Goetz wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I preprared a webrev with fixes for PPC for the VolatileIRIWTest of
>>>>>> the torture test suite:
>>>>>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>>>>>
>>>>>> Example:
>>>>>> volatile x=0, y=0
>>>>>> __________ __________ __________ __________
>>>>>> | Thread 0 | | Thread 1 | | Thread 2 | | Thread 3 |
>>>>>>
>>>>>> write(x=1) read(x) write(y=1) read(y)
>>>>>> read(y) read(x)
>>>>>>
>>>>>> Disallowed: x=1, y=0 y=1, x=0
>>>>>>
>>>>>>
>>>>>> Solution: This example requires multiple-copy-atomicity. This is only
>>>>>> assured by the sync instruction and if it is executed in the threads
>>>>>> doing the loads. Thus we implement volatile read as sync-load-acquire
>>>>>> and omit the sync/MemBarVolatile after the volatile store.
>>>>>> MemBarVolatile happens to be implemented by sync.
>>>>>> We fix this in C2 and the cpp interpreter.
>>>>>>
>>>>>> This addresses a similar issue as fix "8012144: multiple SIGSEGVs
>>>>>> fails on staxf" for taskqueue.hpp.
>>>>>>
>>>>>> Further this change contains a fix that assures that volatile fields
>>>>>> written in constructors are visible before the reference gets
>>>>>> published.
>>>>>>
>>>>>>
>>>>>> Looking at the code, we found a MemBarRelease that to us, seems too
>>>>>> strong.
>>>>>> We think in parse1.cpp do_exits() a MemBarStoreStore should suffice.
>>>>>> What do you think?
>>>>>>
>>>>>> Please review and test this change.
>>>>>>
>>>>>> Best regards,
>>>>>> Goetz.
>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/attachments/20131127/6c3b9fbe/attachment-0001.html
More information about the ppc-aix-port-dev
mailing list