RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes

Tue Nov 26 04:11:29 PST 2013

Hi Goetz,

On 26/11/2013 9:22 PM, Lindenmaier, Goetz wrote:
> Hi everybody,
>
> thanks a lot for the detailed reviews!
> I'll try to answer to all in one mail.
>
>> Volatile fields written in constructor aren't guaranteed by JMM to occur before the reference is assigned;
> We don't think it's correct if we omit the barrier after initializing
> a volatile field.  Previously, we discussed this with Aleksey Shipilev
> and Doug Lea, and they agreed.
> Also, concurrency torture tests
>     LongVolatileTest
>     AtomicIntegerInitialValueTest
> will fail.
> (In addition, observing 0 instead of the inital value of a volatile field would be
> very counter-intuitive for Java programmers, especially in AtomicInteger.)

The affects of unsafe publication are always surprising - volatiles do 
not add anything special here. AFAIK there is nothing in the JMM that 
requires the constructor barrier - discussions with Doug and Aleksey 
notwithstanding. AFAIK we have not seen those tests fail due to a 
missing constructor barrier.

>> proposed for PPC64 is to make volatile reads extremely heavyweight
> Yes, it costs measurable performance.  But else it is wrong.  We don't
> see a way to implement this cheaper.
>
>> - these algorithms should be expressed using the correct OrderAccess operations
> Basically, I agree on this.  But you also have to take into account
> that due to the different memory ordering instructions on different platforms
> just implementing something empty is not sufficient.
> An example:
>    MemBarRelease    // means LoadStore, StoreStore barrier
>    MemBarVolatile    // means  StoreLoad barrier
> If these are consecutively in the code, sparc code looks like this:
>    MemBarRelease  --> membar(Assembler::LoadStore | Assembler::StoreStore)
>    MemBarVolatile --> membar(Assembler::StoreLoad)
> Just doing what is required.
> On Power, we get suboptimal code, as there are no comparable,
> fine grained operations:
>     MemBarRelease --> lwsync   // Doing LoadStore, StoreStore, LoadLoad
>    MemBarVolatile --> sync    // // Doing LoadStore, StoreStore, LoadLoad, StoreLoad
> obviously, the lwsync is superfluous.  Thus, as PPC operations are more (too) powerful,
> I need an additional optimization that removes the lwsync.  I can not implement
> MemBarRelease empty, as it is also used independently.
>
> Back to the IRIW problem.  I think here we have a comparable issue.
> Doing the MemBarVolatile or the OrderAccess::fence() before the read
> is inefficient on platforms that have multiple-read-atomicity.
>
> I would propose to guard the code by
> VM_Version::cpu_is_multiple_read_atomic() or even better
> OrderAccess::cpu_is_multiple_read_atomic()
> Else, David, how would you propose to implement this platform independent?
> (Maybe we can also use above method in taskqueue.hpp.)

I can not possibly answer to the necessary level of detail with a few 
moments thought. You are implying there is a problem here that will 
impact numerous platforms (unless you can tell me why ppc is so 
different?) and I can not take that on face value at the moment. The 
only reason I can see IRIW not being handled by the JMM requirements for 
volatile accesses is if there are global visibility issues that are not 
addressed - but even then I would expect heavy barriers at the store 
would deal with that, not at the load. (This situation reminds me of the 
need for read-barriers on Alpha architecture due to the use of software 
cache-coherency rather than hardware cache-coherency - but we don't have 
that on ppc!)

Sorry - There is no quick resolution here and in a couple of days I will 
be heading out on vacation for two weeks.

David
-----

> Best regards,
>    Goetz.
>
> -- Other ports:
> The IRIW issue requires at least 3 processors to be relevant, so it might
> not happen on small machines.  But I can use PPC_ONLY instead
> of PPC64_ONLY if you request so (and if we don't get rid of them).
>
> -- MemBarStoreStore after initialization
> I agree we should not change it in the ppc port.  If you wish, I can
> prepare an extra webrev for hotspot-comp.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Tuesday, November 26, 2013 2:49 AM
> To: Vladimir Kozlov
> Cc: Lindenmaier, Goetz; 'hotspot-dev at openjdk.java.net'; 'ppc-aix-port-dev at openjdk.java.net'
> Subject: Re: RFR(M): 8029101: PPC64 (part 211): ordering of Independent Reads of Independent Writes
>
> Okay this is my second attempt at answering this in a reasonable way :)
>
> On 26/11/2013 10:51 AM, Vladimir Kozlov wrote:
>> I have to ask David to do correctness evaluation.
>
>   From what I understand what we see here is an attempt to fix an
> existing issue with the implementation of volatiles so that the IRIW
> problem is addressed. The solution proposed for PPC64 is to make
> volatile reads extremely heavyweight by adding a fence() when doing the
> load.
>
> Now if this was purely handled in ppc64 source code then I would be
> happy to let them do whatever they like (surely this kills performance
> though!). But I do not agree with the changes to the shared code that
> allow this solution to be implemented - even with PPC64_ONLY this is
> polluting the shared code. My concern is similar to what I said with the
> taskQueue changes - these algorithms should be expressed using the
> correct OrderAccess operations to guarantee the desired properties
> independent of architecture. If such a "barrier" is not needed on a
> given architecture then the implementation in OrderAccess should reduce
> to a no-op.
>
> And as Vitaly points out the constructor barriers are not needed under
> the JMM.
>
>> I am fine with suggested changes because you did not change our current
>> code for our platforms (please, do not change do_exits() now).
>> But may be it should be done using more general query which is set
>> depending on platform:
>>
>> OrderAccess::needs_support_iriw_ordering()
>>
>> or similar to what we use now:
>>
>> VM_Version::needs_support_iriw_ordering()
>
> Every platform has to support IRIW this is simply part of the Java
> Memory Model, there should not be any need to call this out explicitly
> like this.
>
> Is there some subtlety of the hardware I am missing here? Are there
> visibility issues beyond the ordering constraints that the JMM defines?
>>   From what I understand our ppc port is also affected. David?
>
> We can not discuss that on an OpenJDK mailing list - sorry.
>
> David
> -----
>
>> In library_call.cpp can you add {}? New comment should be inside  else {}.
>>
>> I think you should make _wrote_volatile field not ppc64 specific which
>> will be set to 'true' only on ppc64. Then you will not need PPC64_ONLY()
>> except in do_put_xxx() where it is set to true. Too many #ifdefs.
>>
>> In do_put_xxx() can you combine your changes:
>>
>>      if (is_vol) {
>>        // See comment in do_get_xxx().
>> #ifndef PPC64
>>        insert_mem_bar(Op_MemBarVolatile); // Use fat membar
>> #else
>>        if (is_field) {
>>          // Add MemBarRelease for constructors which write volatile field
>> (PPC64).
>>          set_wrote_volatile(true);
>>        }
>> #endif
>>      }
>>
>> Thanks,
>> Vladimir
>>
>> On 11/25/13 8:16 AM, Lindenmaier, Goetz wrote:
>>> Hi,
>>>
>>> I preprared a webrev with fixes for PPC for the VolatileIRIWTest of
>>> the torture test suite:
>>> http://cr.openjdk.java.net/~goetz/webrevs/8029101-0-raw/
>>>
>>> Example:
>>> volatile x=0, y=0
>>>     __________    __________    __________    __________
>>> | Thread 0 |  | Thread 1 |  | Thread 2 |  | Thread 3 |
>>>
>>>      write(x=1)    read(x)       write(y=1)    read(y)
>>>                    read(y)                     read(x)
>>>
>>> Disallowed:     x=1, y=0                    y=1, x=0
>>>
>>>
>>> Solution: This example requires multiple-copy-atomicity. This is only
>>> assured by the sync instruction and if it is executed in the threads
>>> doing the loads. Thus we implement volatile read as sync-load-acquire
>>> and omit the sync/MemBarVolatile after the volatile store.
>>> MemBarVolatile happens to be implemented by sync.
>>> We fix this in C2 and the cpp interpreter.
>>>
>>> This addresses a similar issue as fix "8012144: multiple SIGSEGVs
>>> fails on staxf" for taskqueue.hpp.
>>>
>>> Further this change contains a fix that assures that volatile fields
>>> written in constructors are visible before the reference gets
>>> published.
>>>
>>>
>>> Looking at the code, we found a MemBarRelease that to us, seems too
>>> strong.
>>> We think in parse1.cpp do_exits() a MemBarStoreStore should suffice.
>>> What do you think?
>>>
>>> Please review and test this change.
>>>
>>> Best regards,
>>>     Goetz.
>>>