RFR(M): Memory ordering in taskqueue.hpp

Wed Mar 20 02:34:12 PDT 2013

Hi David,

as we understand, you can not do this in the existing  loadload and loadstore
methods, as also for the control-dependency you need the dependency to the
load that must be ordered.  I.e., if you implement it like this:
  OrderAccess::loadload()   { __asm__ __volatile__ ("cmp cr0 R0, R0; bne target; isync; label: target")  }
there is no dependency to a load to be ordered.

You could add new methods to OrderAccess as
 loadload_afterDependentBranch which does an isync, and may only be called
if the C-code does a branch depending on the volatile load, e.g.

   b = *(volatile int*)x;
   if (b) {
    OrderAccess::loadload_afterDependentBranch();
    ... further loads or stores which need ordering
  }

LoadStore similarly. 

Further, this instruction sequence orders only one load (load of x in the example) 
against all following ones, while we understood that the loadload() method must
order all preceding loads wrt. all succeeding ones.

By the way, cmp bne is more expensive than twi we use in load_acquire.

Best regards,
  Goetz and Martin

-----Original Message-----
From: David Holmes [mailto:david.holmes at oracle.com] 
Sent: Dienstag, 19. März 2013 21:14
To: Lindenmaier, Goetz
Cc: John Cuthbertson; Siebenborn, Axel; Vitaly Davidovich; David Chase; Simonis, Volker; hotspot-dev at openjdk.java.net
Subject: Re: RFR(M): Memory ordering in taskqueue.hpp

Hi Goetz,

On 19/03/2013 8:00 PM, Lindenmaier, Goetz wrote:
> Hi David,
>
> To implement dependent-load-+isync you need to know the result register of the load.
> With OrderAccess::loadload() this is not expressible. As you can see, we use it in
> load_acquire(), as there we can express the dependency to the register.

Sorry I meant to refer to an introduced control-dependency+isync

David

> Best regards,
>    Goetz.
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Dienstag, 19. März 2013 10:37
> To: Lindenmaier, Goetz
> Cc: John Cuthbertson; Siebenborn, Axel; Vitaly Davidovich; David Chase; Simonis, Volker; hotspot-dev at openjdk.java.net
> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>
> Hi Goetz,
>
> On 19/03/2013 7:21 PM, Lindenmaier, Goetz wrote:
>> Hi John,
>>
>> Good to see there is work on this!
>> We'll be happy to test your change, but we are pretty sure it will fail
>> if you don't issue an isync instruction.
>> To properly compare our and your implementation we need to
>> know how you implemented OrderAccess, i.e., file
>> orderAccess_linux_ppc.inline.hpp.
>>
>> If you may not share this file from your closed source,
>> could you please compare it yourself to our implementation?
>> Ours is open ;)
>> http://hg.openjdk.java.net/ppc-aix-port/jdk7u/hotspot/file/39fdd72ec591/src/os_cpu/linux_ppc/vm/
>
> Looking at the JMM cookbook:
>
> http://g.oswego.edu/dl/jmm/cookbook.html
>
> and your implementation, I note that you use explicit fences (lwsync)
> for loadStore and loadLoad, where it is also possible to introduce a
> dependent-load+isync to get the same effect.
>
> David
> -----
>
>>> BTW - whatever patch is committed - I intend to make you the contributor.
>> That's nice, but please account this fix to Axel Siebenborn, he analyzed the bug,
>> dug out the paper and did the fix.  I only do the openJDK contribution.
>> Anyways, I'm off next week, so he will take over.
>>
>> Best regards,
>>     Goetz.
>>
>>
>>
>> -----Original Message-----
>> From: John Cuthbertson [mailto:john.cuthbertson at oracle.com]
>> Sent: Montag, 18. März 2013 18:56
>> To: Lindenmaier, Goetz
>> Cc: Vitaly Davidovich; David Chase; Simonis, Volker; David Holmes; hotspot-dev at openjdk.java.net
>> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>
>> Hi Geotz,
>>
>> I have a patch that I synthesized from yours that I gave to one our
>> engineers who is working on PowerPC. He was seeing crashes before
>> applying your patch and mine. With either patch applied, the crashes
>> disappear.
>>
>> The only real difference between our patches is that I only ordered
>> access to the _bottom field and left accesses to the _age field (and
>> it's sub fields) unordered. If I give you my patch do you think you
>> could test with it? It might take a day or two as I want to get the
>> fixes for the G1 marking bugs committed - the stability of G1 is my
>> highest priority.
>>
>> BTW - whatever patch is committed - I intend to make you the contributor.
>>
>> Cheers,
>>
>> JohnC
>>
>> On 3/18/2013 3:27 AM, Lindenmaier, Goetz wrote:
>>> Hi,
>>>
>>> I want to take up this issue again.
>>> I think we gave a good reason for the fence we added in the taskqueue.
>>> Do you have any remaining objections against this change?  We are
>>> happy to discuss or adapt the change.
>>> If not, could somebody please push this change to hotspot?
>>>
>>> http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
>>>
>>> Best regards,
>>>      Goetz.
>>>
>>>
>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>> Sent: Freitag, 8. März 2013 17:13
>>> To: Volker Simonis
>>> Cc: hotspot-dev at openjdk.java.net; David Holmes; Lindenmaier, Goetz; Simonis, Volker
>>> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>>
>>>
>>> I see, thanks for the explanation guys.  Is there no atomic/locked (in x86 sense) store instruction on power? Is that what you mean by multiple atomicity? Not familiar with power, so being ignorant :).
>>>
>>> Sent from my phone
>>> On Mar 8, 2013 11:08 AM, "Volker Simonis" <volker.simonis at gmail.com<mailto:volker.simonis at gmail.com>> wrote:
>>> On Fri, Mar 8, 2013 at 4:31 PM, Lindenmaier, Goetz
>>> <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>> wrote:
>>>> Hi Vitaly,
>>>>
>>>> the writer has a sync after the write to bottom (see pop_local)
>>>> but the store and the sync are not one atomic operation.  Stuff
>>>> can happen before the sync is executed.  Therefore the reader
>>>> must sync, too.
>>>>
>>> Exactly - you have three threads. The first one executes the write
>>> to bottom but not the following sync. The second thread reads this
>>> new value (by chance) and uses it but the third thread still sees the
>>> old value because there's no multipl-copy-atomicity on PPC. So we
>>> really need the sync before the read to assure thread three is not seeing
>>> an older value than thread two.
>>>
>>>> Anyways, in pop_global it should be cheaper than in pop_local.
>>>>
>>>> Best regards,
>>>>      Geotz.
>>>>
>>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
>>>> Sent: Freitag, 8. März 2013 16:13
>>>> To: Lindenmaier, Goetz
>>>> Cc: hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net>; David Holmes; John Cuthbertson; Simonis, Volker
>>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>>>
>>>>
>>>> Hmm, if load fences on the reader aren't sufficient, seems like the problem is missing order/fence on the writer? If the write isn't visible to all processors immediately on power, shouldn't there be a fence in the write to ensure all CPUs see it? Putting the fence in the reader seems odd, but maybe that's just me.
>>>>
>>>> Thanks
>>>>
>>>> Sent from my phone
>>>> On Mar 8, 2013 10:03 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com><mailto:goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>>> wrote:
>>>> Hi Vitaly,
>>>>
>>>> No, the problem is not reordering.  The problem is that
>>>> _bottom, which is read after _age, might be older than _age because
>>>> another processor didn't write it back yet.  The fence (sync) makes the
>>>> current thread wait until it has the new _bottom.
>>>> On Power, a write is not visible to all other threads simultaneously
>>>> (no multipl-copy-atomicity).
>>>>
>>>> What you propose helps if you look at two processors, but not if
>>>> four threads on four processors are involved.
>>>>
>>>> Best regards,
>>>>      Goetz.
>>>>
>>>>
>>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com><mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>>]
>>>> Sent: Freitag, 8. März 2013 15:29
>>>> To: Lindenmaier, Goetz
>>>> Cc: David Holmes; hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net><mailto:hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net>>; John Cuthbertson; Simonis, Volker
>>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>>>
>>>>
>>>> Hi Goetz,
>>>>
>>>> If I understand correctly the problem is that bottom is loaded before age via compiler or arch reordering.  But why is a full fence needed? Is it not sufficient to have just a load fence to load age and relaxed load (just compiler barrier) for bottom?
>
>>>>
>>>> Thanks
>>>>
>>>> Sent from my phone
>>>> On Mar 8, 2013 9:06 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com><mailto:goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>>> wrote:
>>>> Hi,
>>>>
>>>> we have been, and still are, doing research on this issue.
>>>> We want to keep you up to date on this, and propose on the
>>>> further proceeding.
>>>>
>>>> You asked explicit memory ordering operations and a rationale
>>>> why we added them.
>>>>
>>>> Axel found a paper describing the task queue algorithm and the
>>>> needed ordering on arm and power:
>>>> Correct and Efficient Work-Stealing for Weak Memory Models;
>>>> Lê, Pop, Cohen, Nardelli; PPoPP'13;
>>>> http://www.di.ens.fr/~zappa/readings/ppopp13.pdf
>>>>
>>>> According to this paper we need to add one fence and one load_acquire
>>>> to your implementation of the task queue.  You find this fence in this small
>>>> webrev:  http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
>>>>
>>>> With this fence, the algorithm works on Linux for our openjdk ppc
>>>> port, and also for our SAP JVM .
>>>>
>>>> Actually, the fence fixes a problem we discovered with the concurrency
>>>> torture test suite.  The problem occurs with four or more GC threads.
>>>> If three threads are stealing from the queue of the fourth, two of
>>>> them can pop the same element. Without a fence between the access
>>>> to age and bottom in pop_global(), bottom can be older than age.
>>>>
>>>>
>>>> Unfortunately, the OpenJDK implementation with the added fence
>>>> does not work on AIX.  Axel already detected one place where the xlC
>>>> compiler optimization removed load instruction that is required for
>>>> the correctness of the algorithm.  If we use our access routines with load_acquire
>>>> (see original webrev below) everything works, unclear why.
>>>>
>>>> Now, we think C++ might allow that this load is removed and xlC does
>>>> the correct, but unexpected thing.  On the other hand it might also be
>>>> a compiler bug.
>>>> We are currently discussing this with the IBM xlC developers.
>>>>
>>>> Best regards,
>>>>      Axel and Goetz.
>>>>
>>>>
>>>> PS: The webrev we proposed originally:
>>>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>