RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes

David Holmes david.holmes at oracle.com
Tue Jan 7 01:22:06 PST 2014


On 7/01/2014 7:10 PM, Lindenmaier, Goetz wrote:
> Hi David,
>
> happy new year for you too!
>
> The compiler uses the following operations for volatiles:
>
> MemBarRelease --> lwsync
> Store                                                       Load
> MemBarVolatile --> sync               MemBarAcquire --> lwsync
>
> With our change we get:
>
> MemBarRelease --> lwsync           MemBarVolatile --> sync
> Store                                                       Load
>                                                                    MemBarAcquire --> lwsync
>
> So the lwsync in your example is added by the MemBarRelease before the
> Store.

Ah I see. Thanks for clarifying. I need to mull on this a little more.

David

> Best regards,
>    Goetz.
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Dienstag, 7. Januar 2014 07:33
> To: Lindenmaier, Goetz
> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>
> Hi Goetz,
>
> Happy New year! :)
>
> I'm backing up a step now that I have a better grip on the IRIW issue.
>
> On 2/12/2013 6:33 PM, Lindenmaier, Goetz wrote:
>> Hi,
>>
>> ok, I understand the tests are wrong.  It's good this issue is settled.
>> Thanks Aleksey and Andreas for going into the details of the proof!
>>
>> About our change:  David, the causality is the other way round.
>> The change is about IRIW.
>> 1. To pass IRIW, we must use sync instructions before loads.
>
> Okay I see this (though I think we need to kill IRIW as a requirement
> and I don't think IRIW should form part of any conformance test).
>
>> 2. If we do syncs before loads, we don't need to do them after stores.
>
> True for the IRIW case but to prevent local reordering of volatile
> stores don't you still need some form of "storestore" barrier?
>
> initially: x = 0; y = 0
> Thread 0:        Thread 1:
>     x = 1            r1 = x
>     y = 1            r2 = y
> ---------------------------
> Should be Forbidden: r1 = 0 ^ r2 = 1
>
> You can add the sync after each read but that doesn't prevent the stores
> in thread 0 from being locally re-ordered. So I think you still need at
> least lwsync on the writer side:
>
> initially: x = 0; y = 0
> Thread 0:        Thread 1:
>     x = 1            r1 = x
>     lwsync           sync
>     y = 1            r2 = y
>                      sync
> ---------------------------
> Forbidden: r1 = 0 ^ r2 = 1
>
> David
> -----
>
>> 3. If we don't do them after stores, we fail the volatile constructor tests.
>> 4. So finally we added them again at the end of the constructor after stores
>>        to pass the volatile constructor tests.
>>
>> We originally passed the constructor tests because the ppc memory order
>> instructions are not as find-granular as the
>> operations in the IR.  MemBarVolatile is specified as StoreLoad.  The only instruction
>> on PPC that does StoreLoad is sync. But sync also does StoreStore, therefore the
>> MemBarVolatile after the store fixes the constructor tests.  The proper representation
>> of the fix in the IR would be adding a MemBarStoreStore.  But now it's pointless
>> anyways.
>>
>>> I'm not happy with the ifdef approach but I won't block it.
>> I'd be happy to add a property
>>     OrderAccess::cpu_is_multiple_copy_atomic()
>> or the like to guard the customization.  I'd like that much better. Or also
>>      OrderAccess::needs_support_iriw_ordering()
>>      VM_Version::needs_support_iriw_ordering()
>>
>>
>> Best regards,
>>     Goetz.
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Donnerstag, 28. November 2013 00:34
>> To: Lindenmaier, Goetz
>> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>
>> TL;DR version:
>>
>> Discussion on the c-i list has now confirmed that a constructor-barrier
>> for volatiles is not required as part of the JMM specification. It *may*
>> be required in an implementation that doesn't pre-zero memory to ensure
>> you can't see uninitialized fields. So the tests for this are invalid
>> and this part of the patch is not needed in general (ppc64 may need it
>> due to other factors).
>>
>> Re: "multiple copy atomicity" - first thanks for correcting the term :)
>> Second thanks for the reference to that paper! For reference:
>>
>> "The memory system (perhaps involving a hierarchy of buffers and a
>> complex interconnect) does not guarantee that a write becomes visible to
>> all other hardware threads at the same time point; these architectures
>> are not multiple-copy atomic."
>>
>> This is the visibility issue that I referred to and affects both ARM and
>> PPC. But of course it is normally handled by using suitable barriers
>> after the stores that need to be visible. I think the crux of the
>> current issue is what you wrote below:
>>
>>    > The fixes for the constructor issue are only needed because we
>>    > remove the sync instruction from behind stores (parse3.cpp:320)
>>    > and place it before loads.
>>
>> I hadn't grasped this part. Obviously if you fail to do the sync after
>> the store then you have to do something around the loads to get the same
>> results! I still don't know what lead you to the conclusion that the
>> only way to fix the IRIW issue was to put the fence before the load -
>> maybe when I get the chance to read that paper in full it will be clearer.
>>
>> So ... the basic problem is that the current structure in the VM has
>> hard-wired one choice of how to get the right semantics for volatile
>> variables. You now want to customize that but not all the requisite
>> hooks are present. It would be better if volatile_load and
>> volatile_store were factored out so that they could be implemented as
>> desired per-platform. Alternatively there could be pre- and post- hooks
>> that could then be customized per platform. Otherwise you need
>> platform-specific ifdef's to handle it as per your patch.
>>
>> I'm not happy with the ifdef approach but I won't block it. I think this
>> is an area where a lot of clean up is needed in the VM. The barrier
>> abstractions are a confused mess in my opinion.
>>
>> Thanks,
>> David
>> -----
>>
>> On 28/11/2013 3:15 AM, Lindenmaier, Goetz wrote:
>>> Hi,
>>>
>>> I updated the webrev to fix the issues mentioned by Vladimir:
>>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>>
>>> I did not yet add the
>>>       OrderAccess::needs_support_iriw_ordering()
>>>       VM_Version::needs_support_iriw_ordering()
>>> or
>>>      OrderAccess::cpu_is_multiple_copy_atomic()
>>> to reduce #defined, as I got no further comment on that.
>>>
>>>
>>> WRT to the validity of the tests and the interpretation of the JMM
>>> I feel not in the position to contribute substantially.
>>>
>>> But we would like to pass the torture test suite as we consider
>>> this a substantial task in implementing a PPC port.  Also we think
>>> both tests show behavior a programmer would expect.  It's bad if
>>> Java code runs fine on the more common x86 platform, and then
>>> fails on ppc.  This will always first be blamed on the VM.
>>>
>>> The fixes for the constructor issue are only needed because we
>>> remove the sync instruction from behind stores (parse3.cpp:320)
>>> and place it before loads.  Then there is no sync between volatile store
>>> and publishing the object.  So we add it again in this one case
>>> (volatile store in constructor).
>>>
>>>
>>> @David
>>>>> Sure.  There also is no solution as you require for the taskqueue problem yet,
>>>>> and that's being discussed now for almost a year.
>>>> It may have started a year ago but work on it has hardly been continuous.
>>> That's not true, we did a lot of investigation and testing on this issue.
>>> And we came up with a solution we consider the best possible. If you
>>> have objections, you should at least give the draft of a better solution,
>>> we would volunteer to implement and test it.
>>> Similarly, we invested time in fixing the concurrency torture issues.
>>>
>>> @David
>>>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>>>> can't find any reference to it.
>>> We learned about this reading "A Tutorial Introduction to the ARM and
>>> POWER Relaxed Memory Models" by Luc Maranget, Susmit Sarkar and
>>> Peter Sewell, which is cited in "Correct and Efficient Work-Stealing for
>>> Weak Memory Models" by Nhat Minh Lê, Antoniu Pop, Albert Cohen
>>> and Francesco Zappa Nardelli (PPoPP `13) when analysing the taskqueue problem.
>>> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
>>>
>>> I was wrong in one thing, it's called multiple copy atomicity, I used 'read'
>>> instead.  Sorry for that.  (I also fixed that in the method name above).
>>>
>>> Best regards and thanks for all your involvements,
>>>      Goetz.
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>> Sent: Mittwoch, 27. November 2013 12:53
>>> To: Lindenmaier, Goetz
>>> Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>>
>>> Hi Goetz,
>>>
>>> On 26/11/2013 10:51 PM, Lindenmaier, Goetz wrote:
>>>> Hi David,
>>>>
>>>> -- Volatile in constuctor
>>>>> AFAIK we have not seen those tests fail due to a
>>>>> missing constructor barrier.
>>>> We see them on PPC64. Our test machines have typically 8-32 processors
>>>> and are Power 5-7.  But see also Aleksey's mail.  (Thanks Aleksey!)
>>>
>>> And see follow ups - the tests are invalid.
>>>
>>>> -- IRIW issue
>>>>> I can not possibly answer to the necessary level of detail with a few
>>>>> moments thought.
>>>> Sure.  There also is no solution as you require for the taskqueue problem yet,
>>>> and that's being discussed now for almost a year.
>>>
>>> It may have started a year ago but work on it has hardly been continuous.
>>>
>>>>> You are implying there is a problem here that will
>>>>> impact numerous platforms (unless you can tell me why ppc is so different?)
>>>> No, only PPC does not have 'multiple-read-atomicity'.  Therefore I contributed a
>>>> solution with the #defines, and that's correct for all, but not nice, I admit.
>>>> (I don't really know about ARM, though).
>>>> So if I can write down a nicer solution testing for methods that are evaluated
>>>> by the C-compiler I'm happy.
>>>>
>>>> The problem is not that IRIW is not handled by the JMM, the problem
>>>> is that
>>>>       store
>>>>       sync
>>>> does not assure multiple-read-atomicity,
>>>> only
>>>>       sync
>>>>       load
>>>> does so on PPC.  And you require multiple-read-atomicity to
>>>> pass that test.
>>>
>>> What is "multiple-read-atomicity"? I'm not familiar with the term and
>>> can't find any reference to it.
>>>
>>> Thanks,
>>> David
>>>
>>>       The JMM is fine. And
>>>>       store
>>>>       MemBarVolatile
>>>> is fine on x86, sparc etc. as there exist assembler instructions that
>>>> do what is required.
>>>>
>>>> So if you are off soon, please let's come to a solution that
>>>> might be improvable in the way it's implemented, but that
>>>> allows us to implement a correct PPC64 port.
>>>>
>>>> Best regards,
>>>>       Goetz.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>>> Sent: Tuesday, November 26, 2013 1:11 PM
>>>> To: Lindenmaier, Goetz
>>>> Cc: 'Vladimir Kozlov'; 'Vitaly Davidovich'; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>>>
>>>> Hi Goetz,
>>>>
>>>> On 26/11/2013 9:22 PM, Lindenmaier, Goetz wrote:
>>>>> Hi everybody,
>>>>>
>>>>> thanks a lot for the detailed reviews!
>>>>> I'll try to answer to all in one mail.
>>>>>
>>>>>> Volatile fields written in constructor aren't guaranteed by JMM to occur before the reference is assigned;
>>>>> We don't think it's correct if we omit the barrier after initializing
>>>>> a volatile field.  Previously, we discussed this with Aleksey Shipilev
>>>>> and Doug Lea, and they agreed.
>>>>> Also, concurrency torture tests
>>>>>         LongVolatileTest
>>>>>         AtomicIntegerInitialValueTest
>>>>> will fail.
>>>>> (In addition, observing 0 instead of the inital value of a volatile field would be
>>>>> very counter-intuitive for Java programmers, especially in AtomicInteger.)
>>>>
>>>> The affects of unsafe publication are always surprising - volatiles do
>>>> not add anything special here. AFAIK there is nothing in the JMM that
>>>> requires the constructor barrier - discussions with Doug and Aleksey
>>>> notwithstanding. AFAIK we have not seen those tests fail due to a
>>>> missing constructor barrier.
>>>>
>>>>>> proposed for PPC64 is to make volatile reads extremely heavyweight
>>>>> Yes, it costs measurable performance.  But else it is wrong.  We don't
>>>>> see a way to implement this cheaper.
>>>>>
>>>>>> - these algorithms should be expressed using the correct OrderAccess operations
>>>>> Basically, I agree on this.  But you also have to take into account
>>>>> that due to the different memory ordering instructions on different platforms
>>>>> just implementing something empty is not sufficient.
>>>>> An example:
>>>>>        MemBarRelease    // means LoadStore, StoreStore barrier
>>>>>        MemBarVolatile    // means  StoreLoad barrier
>>>>> If these are consecutively in the code, sparc code looks like this:
>>>>>        MemBarRelease  --> membar(Assembler::LoadStore | Assembler::StoreStore)
>>>>>        MemBarVolatile --> membar(Assembler::StoreLoad)
>>>>> Just doing what is required.
>>>>> On Power, we get suboptimal code, as there are no comparable,
>>>>> fine grained operations:
>>>>>         MemBarRelease --> lwsync   // Doing LoadStore, StoreStore, LoadLoad
>>>>>        MemBarVolatile --> sync    // // Doing LoadStore, StoreStore, LoadLoad, StoreLoad
>>>>> obviously, the lwsync is superfluous.  Thus, as PPC operations are more (too) powerful,
>>>>> I need an additional optimization that removes the lwsync.  I can not implement
>>>>> MemBarRelease empty, as it is also used independently.
>>>>>
>>>>> Back to the IRIW problem.  I think here we have a comparable issue.
>>>>> Doing the MemBarVolatile or the OrderAccess::fence() before the read
>>>>> is inefficient on platforms that have multiple-read-atomicity.
>>>>>
>>>>> I would propose to guard the code by
>>>>> VM_Version::cpu_is_multiple_read_atomic() or even better
>>>>> OrderAccess::cpu_is_multiple_read_atomic()
>>>>> Else, David, how would you propose to implement this platform independent?
>>>>> (Maybe we can also use above method in taskqueue.hpp.)
>>>>
>>>> I can not possibly answer to the necessary level of detail with a few
>>>> moments thought. You are implying there is a problem here that will
>>>> impact numerous platforms (unless you can tell me why ppc is so
>>>> different?) and I can not take that on face value at the moment. The
>>>> only reason I can see IRIW not being handled by the JMM requirements for
>>>> volatile accesses is if there are global visibility issues that are not
>>>> addressed - but even then I would expect heavy barriers at the store
>>>> would deal with that, not at the load. (This situation reminds me of the
>>>> need for read-barriers on Alpha architecture due to the use of software
>>>> cache-coherency rather than hardware cache-coherency - but we don't have
>>>> that on ppc!)
>>>>
>>>> Sorry - There is no quick resolution here and in a couple of days I will
>>>> be heading out on vacation for two weeks.
>>>>
>>>> David
>>>> -----
>>>>
>>>>> Best regards,
>>>>>        Goetz.
>>>>>
>>>>> -- Other ports:
>>>>> The IRIW issue requires at least 3 processors to be relevant, so it might
>>>>> not happen on small machines.  But I can use PPC_ONLY instead
>>>>> of PPC64_ONLY if you request so (and if we don't get rid of them).
>>>>>
>>>>> -- MemBarStoreStore after initialization
>>>>> I agree we should not change it in the ppc port.  If you wish, I can
>>>>> prepare an extra webrev for hotspot-comp.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>>>> Sent: Tuesday, November 26, 2013 2:49 AM
>>>>> To: Vladimir Kozlov
>>>>> Cc: Lindenmaier, Goetz; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>>>>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>>>>
>>>>> Okay this is my second attempt at answering this in a reasonable way :)
>>>>>
>>>>> On 26/11/2013 10:51 AM, Vladimir Kozlov wrote:
>>>>>> I have to ask David to do correctness evaluation.
>>>>>
>>>>>       From what I understand what we see here is an attempt to fix an
>>>>> existing issue with the implementation of volatiles so that the IRIW
>>>>> problem is addressed. The solution proposed for PPC64 is to make
>>>>> volatile reads extremely heavyweight by adding a fence() when doing the
>>>>> load.
>>>>>
>>>>> Now if this was purely handled in ppc64 source code then I would be
>>>>> happy to let them do whatever they like (surely this kills performance
>>>>> though!). But I do not agree with the changes to the shared code that
>>>>> allow this solution to be implemented - even with PPC64_ONLY this is
>>>>> polluting the shared code. My concern is similar to what I said with the
>>>>> taskQueue changes - these algorithms should be expressed using the
>>>>> correct OrderAccess operations to guarantee the desired properties
>>>>> independent of architecture. If such a "barrier" is not needed on a
>>>>> given architecture then the implementation in OrderAccess should reduce
>>>>> to a no-op.
>>>>>
>>>>> And as Vitaly points out the constructor barriers are not needed under
>>>>> the JMM.
>>>>>
>>>>>> I am fine with suggested changes because you did not change our current
>>>>>> code for our platforms (please, do not change do_exits() now).
>>>>>> But may be it should be done using more general query which is set
>>>>>> depending on platform:
>>>>>>
>>>>>> OrderAccess::needs_support_iriw_ordering()
>>>>>>
>>>>>> or similar to what we use now:
>>>>>>
>>>>>> VM_Version::needs_support_iriw_ordering()
>>>>>
>>>>> Every platform has to support IRIW this is simply part of the Java
>>>>> Memory Model, there should not be any need to call this out explicitly
>>>>> like this.
>>>>>
>>>>> Is there some subtlety of the hardware I am missing here? Are there
>>>>> visibility issues beyond the ordering constraints that the JMM defines?
>>>>>>       From what I understand our ppc port is also affected. David?
>>>>>
>>>>> We can not discuss that on an OpenJDK mailing list - sorry.
>>>>>
>>>>> David
>>>>> -----
>>>>>
>>>>>> In library_call.cpp can you add {}? New comment should be inside  else {}.
>>>>>>
>>>>>> I think you should make _wrote_volatile field not ppc64 specific which
>>>>>> will be set to 'true' only on ppc64. Then you will not need PPC64_ONLY()
>>>>>> except in do_put_xxx() where it is set to true. Too many #ifdefs.
>>>>>>
>>>>>> In do_put_xxx() can you combine your changes:
>>>>>>
>>>>>>          if (is_vol) {
>>>>>>            // See comment in do_get_xxx().
>>>>>> #ifndef PPC64
>>>>>>            insert_mem_bar(Op_MemBarVolatile); // Use fat membar
>>>>>> #else
>>>>>>            if (is_field) {
>>>>>>              // Add MemBarRelease for constructors which write volatile field
>>>>>> (PPC64).
>>>>>>              set_wrote_volatile(true);
>>>>>>            }
>>>>>> #endif
>>>>>>          }
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>> On 11/25/13 8:16 AM, Lindenmaier, Goetz wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I preprared a webrev with fixes for PPC for the VolatileIRIWTest of
>>>>>>> the torture test suite:
>>>>>>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>>>>>>
>>>>>>> Example:
>>>>>>> volatile x=0, y=0
>>>>>>>         __________    __________    __________    __________
>>>>>>> | Thread 0 |  | Thread 1 |  | Thread 2 |  | Thread 3 |
>>>>>>>
>>>>>>>          write(x=1)    read(x)       write(y=1)    read(y)
>>>>>>>                        read(y)                     read(x)
>>>>>>>
>>>>>>> Disallowed:     x=1, y=0                    y=1, x=0
>>>>>>>
>>>>>>>
>>>>>>> Solution: This example requires multiple-copy-atomicity. This is only
>>>>>>> assured by the sync instruction and if it is executed in the threads
>>>>>>> doing the loads. Thus we implement volatile read as sync-load-acquire
>>>>>>> and omit the sync/MemBarVolatile after the volatile store.
>>>>>>> MemBarVolatile happens to be implemented by sync.
>>>>>>> We fix this in C2 and the cpp interpreter.
>>>>>>>
>>>>>>> This addresses a similar issue as fix "8012144: multiple SIGSEGVs
>>>>>>> fails on staxf" for taskqueue.hpp.
>>>>>>>
>>>>>>> Further this change contains a fix that assures that volatile fields
>>>>>>> written in constructors are visible before the reference gets
>>>>>>> published.
>>>>>>>
>>>>>>>
>>>>>>> Looking at the code, we found a MemBarRelease that to us, seems too
>>>>>>> strong.
>>>>>>> We think in parse1.cpp do_exits() a MemBarStoreStore should suffice.
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Please review and test this change.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>         Goetz.
>>>>>>>


More information about the hotspot-dev mailing list