RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes

Wed Nov 27 09:15:56 PST 2013

Hi,

I updated the webrev to fix the issues mentioned by Vladimir:
http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/

I did not yet add the 
   OrderAccess::needs_support_iriw_ordering()  
   VM_Version::needs_support_iriw_ordering()
or 
  OrderAccess::cpu_is_multiple_copy_atomic()
to reduce #defined, as I got no further comment on that.

WRT to the validity of the tests and the interpretation of the JMM
I feel not in the position to contribute substantially.

But we would like to pass the torture test suite as we consider 
this a substantial task in implementing a PPC port.  Also we think
both tests show behavior a programmer would expect.  It's bad if
Java code runs fine on the more common x86 platform, and then
fails on ppc.  This will always first be blamed on the VM.

The fixes for the constructor issue are only needed because we
remove the sync instruction from behind stores (parse3.cpp:320)
and place it before loads.  Then there is no sync between volatile store
and publishing the object.  So we add it again in this one case 
(volatile store in constructor).

@David
>> Sure.  There also is no solution as you require for the taskqueue problem yet,
>> and that's being discussed now for almost a year.
>It may have started a year ago but work on it has hardly been continuous.
That's not true, we did a lot of investigation and testing on this issue.
And we came up with a solution we consider the best possible. If you
have objections, you should at least give the draft of a better solution, 
we would volunteer to implement and test it.
Similarly, we invested time in fixing the concurrency torture issues.

@David
>What is "multiple-read-atomicity"? I'm not familiar with the term and 
>can't find any reference to it.
We learned about this reading "A Tutorial Introduction to the ARM and
POWER Relaxed Memory Models" by Luc Maranget, Susmit Sarkar and 
Peter Sewell, which is cited in "Correct and Efficient Work-Stealing for
Weak Memory Models" by Nhat Minh Lê, Antoniu Pop, Albert Cohen
and Francesco Zappa Nardelli (PPoPP `13) when analysing the taskqueue problem.
http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

I was wrong in one thing, it's called multiple copy atomicity, I used 'read'
instead.  Sorry for that.  (I also fixed that in the method name above).

Best regards and thanks for all your involvements,
  Goetz.

-----Original Message-----
From: David Holmes [mailto:david.holmes at oracle.com] 
Sent: Mittwoch, 27. November 2013 12:53
To: Lindenmaier, Goetz
Cc: 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes

Hi Goetz,

On 26/11/2013 10:51 PM, Lindenmaier, Goetz wrote:
> Hi David,
>
> -- Volatile in constuctor
>> AFAIK we have not seen those tests fail due to a
>> missing constructor barrier.
> We see them on PPC64. Our test machines have typically 8-32 processors
> and are Power 5-7.  But see also Aleksey's mail.  (Thanks Aleksey!)

And see follow ups - the tests are invalid.

> -- IRIW issue
>> I can not possibly answer to the necessary level of detail with a few
>> moments thought.
> Sure.  There also is no solution as you require for the taskqueue problem yet,
> and that's being discussed now for almost a year.

It may have started a year ago but work on it has hardly been continuous.

>> You are implying there is a problem here that will
>> impact numerous platforms (unless you can tell me why ppc is so different?)
> No, only PPC does not have 'multiple-read-atomicity'.  Therefore I contributed a
> solution with the #defines, and that's correct for all, but not nice, I admit.
> (I don't really know about ARM, though).
> So if I can write down a nicer solution testing for methods that are evaluated
> by the C-compiler I'm happy.
>
> The problem is not that IRIW is not handled by the JMM, the problem
> is that
>    store
>    sync
> does not assure multiple-read-atomicity,
> only
>    sync
>    load
> does so on PPC.  And you require multiple-read-atomicity to
> pass that test.

What is "multiple-read-atomicity"? I'm not familiar with the term and 
can't find any reference to it.

Thanks,
David

   The JMM is fine. And
>    store
>    MemBarVolatile
> is fine on x86, sparc etc. as there exist assembler instructions that
> do what is required.
>
> So if you are off soon, please let's come to a solution that
> might be improvable in the way it's implemented, but that
> allows us to implement a correct PPC64 port.
>
> Best regards,
>    Goetz.
>
>
>
>
>
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Tuesday, November 26, 2013 1:11 PM
> To: Lindenmaier, Goetz
> Cc: 'Vladimir Kozlov'; 'Vitaly Davidovich'; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>
> Hi Goetz,
>
> On 26/11/2013 9:22 PM, Lindenmaier, Goetz wrote:
>> Hi everybody,
>>
>> thanks a lot for the detailed reviews!
>> I'll try to answer to all in one mail.
>>
>>> Volatile fields written in constructor aren't guaranteed by JMM to occur before the reference is assigned;
>> We don't think it's correct if we omit the barrier after initializing
>> a volatile field.  Previously, we discussed this with Aleksey Shipilev
>> and Doug Lea, and they agreed.
>> Also, concurrency torture tests
>>      LongVolatileTest
>>      AtomicIntegerInitialValueTest
>> will fail.
>> (In addition, observing 0 instead of the inital value of a volatile field would be
>> very counter-intuitive for Java programmers, especially in AtomicInteger.)
>
> The affects of unsafe publication are always surprising - volatiles do
> not add anything special here. AFAIK there is nothing in the JMM that
> requires the constructor barrier - discussions with Doug and Aleksey
> notwithstanding. AFAIK we have not seen those tests fail due to a
> missing constructor barrier.
>
>>> proposed for PPC64 is to make volatile reads extremely heavyweight
>> Yes, it costs measurable performance.  But else it is wrong.  We don't
>> see a way to implement this cheaper.
>>
>>> - these algorithms should be expressed using the correct OrderAccess operations
>> Basically, I agree on this.  But you also have to take into account
>> that due to the different memory ordering instructions on different platforms
>> just implementing something empty is not sufficient.
>> An example:
>>     MemBarRelease    // means LoadStore, StoreStore barrier
>>     MemBarVolatile    // means  StoreLoad barrier
>> If these are consecutively in the code, sparc code looks like this:
>>     MemBarRelease  --> membar(Assembler::LoadStore | Assembler::StoreStore)
>>     MemBarVolatile --> membar(Assembler::StoreLoad)
>> Just doing what is required.
>> On Power, we get suboptimal code, as there are no comparable,
>> fine grained operations:
>>      MemBarRelease --> lwsync   // Doing LoadStore, StoreStore, LoadLoad
>>     MemBarVolatile --> sync    // // Doing LoadStore, StoreStore, LoadLoad, StoreLoad
>> obviously, the lwsync is superfluous.  Thus, as PPC operations are more (too) powerful,
>> I need an additional optimization that removes the lwsync.  I can not implement
>> MemBarRelease empty, as it is also used independently.
>>
>> Back to the IRIW problem.  I think here we have a comparable issue.
>> Doing the MemBarVolatile or the OrderAccess::fence() before the read
>> is inefficient on platforms that have multiple-read-atomicity.
>>
>> I would propose to guard the code by
>> VM_Version::cpu_is_multiple_read_atomic() or even better
>> OrderAccess::cpu_is_multiple_read_atomic()
>> Else, David, how would you propose to implement this platform independent?
>> (Maybe we can also use above method in taskqueue.hpp.)
>
> I can not possibly answer to the necessary level of detail with a few
> moments thought. You are implying there is a problem here that will
> impact numerous platforms (unless you can tell me why ppc is so
> different?) and I can not take that on face value at the moment. The
> only reason I can see IRIW not being handled by the JMM requirements for
> volatile accesses is if there are global visibility issues that are not
> addressed - but even then I would expect heavy barriers at the store
> would deal with that, not at the load. (This situation reminds me of the
> need for read-barriers on Alpha architecture due to the use of software
> cache-coherency rather than hardware cache-coherency - but we don't have
> that on ppc!)
>
> Sorry - There is no quick resolution here and in a couple of days I will
> be heading out on vacation for two weeks.
>
> David
> -----
>
>> Best regards,
>>     Goetz.
>>
>> -- Other ports:
>> The IRIW issue requires at least 3 processors to be relevant, so it might
>> not happen on small machines.  But I can use PPC_ONLY instead
>> of PPC64_ONLY if you request so (and if we don't get rid of them).
>>
>> -- MemBarStoreStore after initialization
>> I agree we should not change it in the ppc port.  If you wish, I can
>> prepare an extra webrev for hotspot-comp.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Tuesday, November 26, 2013 2:49 AM
>> To: Vladimir Kozlov
>> Cc: Lindenmaier, Goetz; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
>> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>>
>> Okay this is my second attempt at answering this in a reasonable way :)
>>
>> On 26/11/2013 10:51 AM, Vladimir Kozlov wrote:
>>> I have to ask David to do correctness evaluation.
>>
>>    From what I understand what we see here is an attempt to fix an
>> existing issue with the implementation of volatiles so that the IRIW
>> problem is addressed. The solution proposed for PPC64 is to make
>> volatile reads extremely heavyweight by adding a fence() when doing the
>> load.
>>
>> Now if this was purely handled in ppc64 source code then I would be
>> happy to let them do whatever they like (surely this kills performance
>> though!). But I do not agree with the changes to the shared code that
>> allow this solution to be implemented - even with PPC64_ONLY this is
>> polluting the shared code. My concern is similar to what I said with the
>> taskQueue changes - these algorithms should be expressed using the
>> correct OrderAccess operations to guarantee the desired properties
>> independent of architecture. If such a "barrier" is not needed on a
>> given architecture then the implementation in OrderAccess should reduce
>> to a no-op.
>>
>> And as Vitaly points out the constructor barriers are not needed under
>> the JMM.
>>
>>> I am fine with suggested changes because you did not change our current
>>> code for our platforms (please, do not change do_exits() now).
>>> But may be it should be done using more general query which is set
>>> depending on platform:
>>>
>>> OrderAccess::needs_support_iriw_ordering()
>>>
>>> or similar to what we use now:
>>>
>>> VM_Version::needs_support_iriw_ordering()
>>
>> Every platform has to support IRIW this is simply part of the Java
>> Memory Model, there should not be any need to call this out explicitly
>> like this.
>>
>> Is there some subtlety of the hardware I am missing here? Are there
>> visibility issues beyond the ordering constraints that the JMM defines?
>>>    From what I understand our ppc port is also affected. David?
>>
>> We can not discuss that on an OpenJDK mailing list - sorry.
>>
>> David
>> -----
>>
>>> In library_call.cpp can you add {}? New comment should be inside  else {}.
>>>
>>> I think you should make _wrote_volatile field not ppc64 specific which
>>> will be set to 'true' only on ppc64. Then you will not need PPC64_ONLY()
>>> except in do_put_xxx() where it is set to true. Too many #ifdefs.
>>>
>>> In do_put_xxx() can you combine your changes:
>>>
>>>       if (is_vol) {
>>>         // See comment in do_get_xxx().
>>> #ifndef PPC64
>>>         insert_mem_bar(Op_MemBarVolatile); // Use fat membar
>>> #else
>>>         if (is_field) {
>>>           // Add MemBarRelease for constructors which write volatile field
>>> (PPC64).
>>>           set_wrote_volatile(true);
>>>         }
>>> #endif
>>>       }
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 11/25/13 8:16 AM, Lindenmaier, Goetz wrote:
>>>> Hi,
>>>>
>>>> I preprared a webrev with fixes for PPC for the VolatileIRIWTest of
>>>> the torture test suite:
>>>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>>>
>>>> Example:
>>>> volatile x=0, y=0
>>>>      __________    __________    __________    __________
>>>> | Thread 0 |  | Thread 1 |  | Thread 2 |  | Thread 3 |
>>>>
>>>>       write(x=1)    read(x)       write(y=1)    read(y)
>>>>                     read(y)                     read(x)
>>>>
>>>> Disallowed:     x=1, y=0                    y=1, x=0
>>>>
>>>>
>>>> Solution: This example requires multiple-copy-atomicity. This is only
>>>> assured by the sync instruction and if it is executed in the threads
>>>> doing the loads. Thus we implement volatile read as sync-load-acquire
>>>> and omit the sync/MemBarVolatile after the volatile store.
>>>> MemBarVolatile happens to be implemented by sync.
>>>> We fix this in C2 and the cpp interpreter.
>>>>
>>>> This addresses a similar issue as fix "8012144: multiple SIGSEGVs
>>>> fails on staxf" for taskqueue.hpp.
>>>>
>>>> Further this change contains a fix that assures that volatile fields
>>>> written in constructors are visible before the reference gets
>>>> published.
>>>>
>>>>
>>>> Looking at the code, we found a MemBarRelease that to us, seems too
>>>> strong.
>>>> We think in parse1.cpp do_exits() a MemBarStoreStore should suffice.
>>>> What do you think?
>>>>
>>>> Please review and test this change.
>>>>
>>>> Best regards,
>>>>      Goetz.
>>>>