RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
Lindenmaier, Goetz
goetz.lindenmaier at sap.com
Mon Dec 2 00:33:16 PST 2013
Hi,
ok, I understand the tests are wrong. It's good this issue is settled.
Thanks Aleksey and Andreas for going into the details of the proof!
About our change: David, the causality is the other way round.
The change is about IRIW.
1. To pass IRIW, we must use sync instructions before loads.
2. If we do syncs before loads, we don't need to do them after stores.
3. If we don't do them after stores, we fail the volatile constructor tests.
4. So finally we added them again at the end of the constructor after stores
to pass the volatile constructor tests.
We originally passed the constructor tests because the ppc memory order
instructions are not as find-granular as the
operations in the IR. MemBarVolatile is specified as StoreLoad. The only instruction
on PPC that does StoreLoad is sync. But sync also does StoreStore, therefore the
MemBarVolatile after the store fixes the constructor tests. The proper representation
of the fix in the IR would be adding a MemBarStoreStore. But now it's pointless
anyways.
> I'm not happy with the ifdef approach but I won't block it.
I'd be happy to add a property
OrderAccess::cpu_is_multiple_copy_atomic()
or the like to guard the customization. I'd like that much better. Or also
OrderAccess::needs_support_iriw_ordering()
VM_Version::needs_support_iriw_ordering()
Best regards,
Goetz.
-----Original Message-----
From: David Holmes [mailto:david.holmes at oracle.com]
Sent: Donnerstag, 28. November 2013 00:34
To: Lindenmaier, Goetz
Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
TL;DR version:
Discussion on the c-i list has now confirmed that a constructor-barrier
for volatiles is not required as part of the JMM specification. It *may*
be required in an implementation that doesn't pre-zero memory to ensure
you can't see uninitialized fields. So the tests for this are invalid
and this part of the patch is not needed in general (ppc64 may need it
due to other factors).
Re: "multiple copy atomicity" - first thanks for correcting the term :)
Second thanks for the reference to that paper! For reference:
"The memory system (perhaps involving a hierarchy of buffers and a
complex interconnect) does not guarantee that a write becomes visible to
all other hardware threads at the same time point; these architectures
are not multiple-copy atomic."
This is the visibility issue that I referred to and affects both ARM and
PPC. But of course it is normally handled by using suitable barriers
after the stores that need to be visible. I think the crux of the
current issue is what you wrote below:
> The fixes for the constructor issue are only needed because we
> remove the sync instruction from behind stores (parse3.cpp:320)
> and place it before loads.
I hadn't grasped this part. Obviously if you fail to do the sync after
the store then you have to do something around the loads to get the same
results! I still don't know what lead you to the conclusion that the
only way to fix the IRIW issue was to put the fence before the load -
maybe when I get the chance to read that paper in full it will be clearer.
So ... the basic problem is that the current structure in the VM has
hard-wired one choice of how to get the right semantics for volatile
variables. You now want to customize that but not all the requisite
hooks are present. It would be better if volatile_load and
volatile_store were factored out so that they could be implemented as
desired per-platform. Alternatively there could be pre- and post- hooks
that could then be customized per platform. Otherwise you need
platform-specific ifdef's to handle it as per your patch.
I'm not happy with the ifdef approach but I won't block it. I think this
is an area where a lot of clean up is needed in the VM. The barrier
abstractions are a confused mess in my opinion.
Thanks,
David
-----
On 28/11/2013 3:15 AM, Lindenmaier, Goetz wrote:
> Hi,
>
> I updated the webrev to fix the issues mentioned by Vladimir:
> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>
> I did not yet add the
> OrderAccess::needs_support_iriw_ordering()
> VM_Version::needs_support_iriw_ordering()
> or
> OrderAccess::cpu_is_multiple_copy_atomic()
> to reduce #defined, as I got no further comment on that.
>
>
> WRT to the validity of the tests and the interpretation of the JMM
> I feel not in the position to contribute substantially.
>
> But we would like to pass the torture test suite as we consider
> this a substantial task in implementing a PPC port. Also we think
> both tests show behavior a programmer would expect. It's bad if
> Java code runs fine on the more common x86 platform, and then
> fails on ppc. This will always first be blamed on the VM.
>
> The fixes for the constructor issue are only needed because we
> remove the sync instruction from behind stores (parse3.cpp:320)
> and place it before loads. Then there is no sync between volatile store
> and publishing the object. So we add it again in this one case
> (volatile store in constructor).
>
>
> @David
>>> Sure. There also is no solution as you require for the taskqueue problem yet,
>>> and that's being discussed now for almost a year.
>> It may have started a year ago but work on it has hardly been continuous.
> That's not true, we did a lot of investigation and testing on this issue.
> And we came up with a solution we consider the best possible. If you
> have objections, you should at least give the draft of a better solution,
> we would volunteer to implement and test it.
> Similarly, we invested time in fixing the concurrency torture issues.
>
> @David
>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>> can't find any reference to it.
> We learned about this reading "A Tutorial Introduction to the ARM and
> POWER Relaxed Memory Models" by Luc Maranget, Susmit Sarkar and
> Peter Sewell, which is cited in "Correct and Efficient Work-Stealing for
> Weak Memory Models" by Nhat Minh Lê, Antoniu Pop, Albert Cohen
> and Francesco Zappa Nardelli (PPoPP `13) when analysing the taskqueue problem.
> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>
> I was wrong in one thing, it's called multiple copy atomicity, I used 'read'
> instead. Sorry for that. (I also fixed that in the method name above).
>
> Best regards and thanks for all your involvements,
> Goetz.
>
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Mittwoch, 27. November 2013 12:53
> To: Lindenmaier, Goetz
> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>
> Hi Goetz,
>
> On 26/11/2013 10:51 PM, Lindenmaier, Goetz wrote:
>> Hi David,
>>
>> -- Volatile in constuctor
>>> AFAIK we have not seen those tests fail due to a
>>> missing constructor barrier.
>> We see them on PPC64. Our test machines have typically 8-32 processors
>> and are Power 5-7. But see also Aleksey's mail. (Thanks Aleksey!)
>
> And see follow ups - the tests are invalid.
>
>> -- IRIW issue
>>> I can not possibly answer to the necessary level of detail with a few
>>> moments thought.
>> Sure. There also is no solution as you require for the taskqueue problem yet,
>> and that's being discussed now for almost a year.
>
> It may have started a year ago but work on it has hardly been continuous.
>
>>> You are implying there is a problem here that will
>>> impact numerous platforms (unless you can tell me why ppc is so different?)
>> No, only PPC does not have 'multiple-read-atomicity'. Therefore I contributed a
>> solution with the #defines, and that's correct for all, but not nice, I admit.
>> (I don't really know about ARM, though).
>> So if I can write down a nicer solution testing for methods that are evaluated
>> by the C-compiler I'm happy.
>>
>> The problem is not that IRIW is not handled by the JMM, the problem
>> is that
>> store
>> sync
>> does not assure multiple-read-atomicity,
>> only
>> sync
>> load
>> does so on PPC. And you require multiple-read-atomicity to
>> pass that test.
>
> What is "multiple-read-atomicity"? I'm not familiar with the term and
> can't find any reference to it.
>
> Thanks,
> David
>
> The JMM is fine. And
>> store
>> MemBarVolatile
>> is fine on x86, sparc etc. as there exist assembler instructions that
>> do what is required.
>>
>> So if you are off soon, please let's come to a solution that
>> might be improvable in the way it's implemented, but that
>> allows us to implement a correct PPC64 port.
>>
>> Best regards,
>> Goetz.
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Tuesday, November 26, 2013 1:11 PM
>> To: Lindenmaier, Goetz
>> Cc: 'Vladimir Kozlov'; 'Vitaly Davidovich'; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>
>> Hi Goetz,
>>
>> On 26/11/2013 9:22 PM, Lindenmaier, Goetz wrote:
>>> Hi everybody,
>>>
>>> thanks a lot for the detailed reviews!
>>> I'll try to answer to all in one mail.
>>>
>>>> Volatile fields written in constructor aren't guaranteed by JMM to occur before the reference is assigned;
>>> We don't think it's correct if we omit the barrier after initializing
>>> a volatile field. Previously, we discussed this with Aleksey Shipilev
>>> and Doug Lea, and they agreed.
>>> Also, concurrency torture tests
>>> LongVolatileTest
>>> AtomicIntegerInitialValueTest
>>> will fail.
>>> (In addition, observing 0 instead of the inital value of a volatile field would be
>>> very counter-intuitive for Java programmers, especially in AtomicInteger.)
>>
>> The affects of unsafe publication are always surprising - volatiles do
>> not add anything special here. AFAIK there is nothing in the JMM that
>> requires the constructor barrier - discussions with Doug and Aleksey
>> notwithstanding. AFAIK we have not seen those tests fail due to a
>> missing constructor barrier.
>>
>>>> proposed for PPC64 is to make volatile reads extremely heavyweight
>>> Yes, it costs measurable performance. But else it is wrong. We don't
>>> see a way to implement this cheaper.
>>>
>>>> - these algorithms should be expressed using the correct OrderAccess operations
>>> Basically, I agree on this. But you also have to take into account
>>> that due to the different memory ordering instructions on different platforms
>>> just implementing something empty is not sufficient.
>>> An example:
>>> MemBarRelease // means LoadStore, StoreStore barrier
>>> MemBarVolatile // means StoreLoad barrier
>>> If these are consecutively in the code, sparc code looks like this:
>>> MemBarRelease --> membar(Assembler::LoadStore | Assembler::StoreStore)
>>> MemBarVolatile --> membar(Assembler::StoreLoad)
>>> Just doing what is required.
>>> On Power, we get suboptimal code, as there are no comparable,
>>> fine grained operations:
>>> MemBarRelease --> lwsync // Doing LoadStore, StoreStore, LoadLoad
>>> MemBarVolatile --> sync // // Doing LoadStore, StoreStore, LoadLoad, StoreLoad
>>> obviously, the lwsync is superfluous. Thus, as PPC operations are more (too) powerful,
>>> I need an additional optimization that removes the lwsync. I can not implement
>>> MemBarRelease empty, as it is also used independently.
>>>
>>> Back to the IRIW problem. I think here we have a comparable issue.
>>> Doing the MemBarVolatile or the OrderAccess::fence() before the read
>>> is inefficient on platforms that have multiple-read-atomicity.
>>>
>>> I would propose to guard the code by
>>> VM_Version::cpu_is_multiple_read_atomic() or even better
>>> OrderAccess::cpu_is_multiple_read_atomic()
>>> Else, David, how would you propose to implement this platform independent?
>>> (Maybe we can also use above method in taskqueue.hpp.)
>>
>> I can not possibly answer to the necessary level of detail with a few
>> moments thought. You are implying there is a problem here that will
>> impact numerous platforms (unless you can tell me why ppc is so
>> different?) and I can not take that on face value at the moment. The
>> only reason I can see IRIW not being handled by the JMM requirements for
>> volatile accesses is if there are global visibility issues that are not
>> addressed - but even then I would expect heavy barriers at the store
>> would deal with that, not at the load. (This situation reminds me of the
>> need for read-barriers on Alpha architecture due to the use of software
>> cache-coherency rather than hardware cache-coherency - but we don't have
>> that on ppc!)
>>
>> Sorry - There is no quick resolution here and in a couple of days I will
>> be heading out on vacation for two weeks.
>>
>> David
>> -----
>>
>>> Best regards,
>>> Goetz.
>>>
>>> -- Other ports:
>>> The IRIW issue requires at least 3 processors to be relevant, so it might
>>> not happen on small machines. But I can use PPC_ONLY instead
>>> of PPC64_ONLY if you request so (and if we don't get rid of them).
>>>
>>> -- MemBarStoreStore after initialization
>>> I agree we should not change it in the ppc port. If you wish, I can
>>> prepare an extra webrev for hotspot-comp.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>> Sent: Tuesday, November 26, 2013 2:49 AM
>>> To: Vladimir Kozlov
>>> Cc: Lindenmaier, Goetz; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>>
>>> Okay this is my second attempt at answering this in a reasonable way :)
>>>
>>> On 26/11/2013 10:51 AM, Vladimir Kozlov wrote:
>>>> I have to ask David to do correctness evaluation.
>>>
>>> From what I understand what we see here is an attempt to fix an
>>> existing issue with the implementation of volatiles so that the IRIW
>>> problem is addressed. The solution proposed for PPC64 is to make
>>> volatile reads extremely heavyweight by adding a fence() when doing the
>>> load.
>>>
>>> Now if this was purely handled in ppc64 source code then I would be
>>> happy to let them do whatever they like (surely this kills performance
>>> though!). But I do not agree with the changes to the shared code that
>>> allow this solution to be implemented - even with PPC64_ONLY this is
>>> polluting the shared code. My concern is similar to what I said with the
>>> taskQueue changes - these algorithms should be expressed using the
>>> correct OrderAccess operations to guarantee the desired properties
>>> independent of architecture. If such a "barrier" is not needed on a
>>> given architecture then the implementation in OrderAccess should reduce
>>> to a no-op.
>>>
>>> And as Vitaly points out the constructor barriers are not needed under
>>> the JMM.
>>>
>>>> I am fine with suggested changes because you did not change our current
>>>> code for our platforms (please, do not change do_exits() now).
>>>> But may be it should be done using more general query which is set
>>>> depending on platform:
>>>>
>>>> OrderAccess::needs_support_iriw_ordering()
>>>>
>>>> or similar to what we use now:
>>>>
>>>> VM_Version::needs_support_iriw_ordering()
>>>
>>> Every platform has to support IRIW this is simply part of the Java
>>> Memory Model, there should not be any need to call this out explicitly
>>> like this.
>>>
>>> Is there some subtlety of the hardware I am missing here? Are there
>>> visibility issues beyond the ordering constraints that the JMM defines?
>>>> From what I understand our ppc port is also affected. David?
>>>
>>> We can not discuss that on an OpenJDK mailing list - sorry.
>>>
>>> David
>>> -----
>>>
>>>> In library_call.cpp can you add {}? New comment should be inside else {}.
>>>>
>>>> I think you should make _wrote_volatile field not ppc64 specific which
>>>> will be set to 'true' only on ppc64. Then you will not need PPC64_ONLY()
>>>> except in do_put_xxx() where it is set to true. Too many #ifdefs.
>>>>
>>>> In do_put_xxx() can you combine your changes:
>>>>
>>>> if (is_vol) {
>>>> // See comment in do_get_xxx().
>>>> #ifndef PPC64
>>>> insert_mem_bar(Op_MemBarVolatile); // Use fat membar
>>>> #else
>>>> if (is_field) {
>>>> // Add MemBarRelease for constructors which write volatile field
>>>> (PPC64).
>>>> set_wrote_volatile(true);
>>>> }
>>>> #endif
>>>> }
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 11/25/13 8:16 AM, Lindenmaier, Goetz wrote:
>>>>> Hi,
>>>>>
>>>>> I preprared a webrev with fixes for PPC for the VolatileIRIWTest of
>>>>> the torture test suite:
>>>>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>>>>
>>>>> Example:
>>>>> volatile x=0, y=0
>>>>> __________ __________ __________ __________
>>>>> | Thread 0 | | Thread 1 | | Thread 2 | | Thread 3 |
>>>>>
>>>>> write(x=1) read(x) write(y=1) read(y)
>>>>> read(y) read(x)
>>>>>
>>>>> Disallowed: x=1, y=0 y=1, x=0
>>>>>
>>>>>
>>>>> Solution: This example requires multiple-copy-atomicity. This is only
>>>>> assured by the sync instruction and if it is executed in the threads
>>>>> doing the loads. Thus we implement volatile read as sync-load-acquire
>>>>> and omit the sync/MemBarVolatile after the volatile store.
>>>>> MemBarVolatile happens to be implemented by sync.
>>>>> We fix this in C2 and the cpp interpreter.
>>>>>
>>>>> This addresses a similar issue as fix "8012144: multiple SIGSEGVs
>>>>> fails on staxf" for taskqueue.hpp.
>>>>>
>>>>> Further this change contains a fix that assures that volatile fields
>>>>> written in constructors are visible before the reference gets
>>>>> published.
>>>>>
>>>>>
>>>>> Looking at the code, we found a MemBarRelease that to us, seems too
>>>>> strong.
>>>>> We think in parse1.cpp do_exits() a MemBarStoreStore should suffice.
>>>>> What do you think?
>>>>>
>>>>> Please review and test this change.
>>>>>
>>>>> Best regards,
>>>>> Goetz.
>>>>>
More information about the ppc-aix-port-dev
mailing list