RFR(M): Memory ordering in taskqueue.hpp

Tue Mar 19 13:13:55 PDT 2013

Hi Goetz,

On 19/03/2013 8:00 PM, Lindenmaier, Goetz wrote:
> Hi David,
>
> To implement dependent-load-+isync you need to know the result register of the load.
> With OrderAccess::loadload() this is not expressible. As you can see, we use it in
> load_acquire(), as there we can express the dependency to the register.

Sorry I meant to refer to an introduced control-dependency+isync

David

> Best regards,
>    Goetz.
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Dienstag, 19. März 2013 10:37
> To: Lindenmaier, Goetz
> Cc: John Cuthbertson; Siebenborn, Axel; Vitaly Davidovich; David Chase; Simonis, Volker; hotspot-dev at openjdk.java.net
> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>
> Hi Goetz,
>
> On 19/03/2013 7:21 PM, Lindenmaier, Goetz wrote:
>> Hi John,
>>
>> Good to see there is work on this!
>> We'll be happy to test your change, but we are pretty sure it will fail
>> if you don't issue an isync instruction.
>> To properly compare our and your implementation we need to
>> know how you implemented OrderAccess, i.e., file
>> orderAccess_linux_ppc.inline.hpp.
>>
>> If you may not share this file from your closed source,
>> could you please compare it yourself to our implementation?
>> Ours is open ;)
>> http://hg.openjdk.java.net/ppc-aix-port/jdk7u/hotspot/file/39fdd72ec591/src/os_cpu/linux_ppc/vm/
>
> Looking at the JMM cookbook:
>
> http://g.oswego.edu/dl/jmm/cookbook.html
>
> and your implementation, I note that you use explicit fences (lwsync)
> for loadStore and loadLoad, where it is also possible to introduce a
> dependent-load+isync to get the same effect.
>
> David
> -----
>
>>> BTW - whatever patch is committed - I intend to make you the contributor.
>> That's nice, but please account this fix to Axel Siebenborn, he analyzed the bug,
>> dug out the paper and did the fix.  I only do the openJDK contribution.
>> Anyways, I'm off next week, so he will take over.
>>
>> Best regards,
>>     Goetz.
>>
>>
>>
>> -----Original Message-----
>> From: John Cuthbertson [mailto:john.cuthbertson at oracle.com]
>> Sent: Montag, 18. März 2013 18:56
>> To: Lindenmaier, Goetz
>> Cc: Vitaly Davidovich; David Chase; Simonis, Volker; David Holmes; hotspot-dev at openjdk.java.net
>> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>
>> Hi Geotz,
>>
>> I have a patch that I synthesized from yours that I gave to one our
>> engineers who is working on PowerPC. He was seeing crashes before
>> applying your patch and mine. With either patch applied, the crashes
>> disappear.
>>
>> The only real difference between our patches is that I only ordered
>> access to the _bottom field and left accesses to the _age field (and
>> it's sub fields) unordered. If I give you my patch do you think you
>> could test with it? It might take a day or two as I want to get the
>> fixes for the G1 marking bugs committed - the stability of G1 is my
>> highest priority.
>>
>> BTW - whatever patch is committed - I intend to make you the contributor.
>>
>> Cheers,
>>
>> JohnC
>>
>> On 3/18/2013 3:27 AM, Lindenmaier, Goetz wrote:
>>> Hi,
>>>
>>> I want to take up this issue again.
>>> I think we gave a good reason for the fence we added in the taskqueue.
>>> Do you have any remaining objections against this change?  We are
>>> happy to discuss or adapt the change.
>>> If not, could somebody please push this change to hotspot?
>>>
>>> http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
>>>
>>> Best regards,
>>>      Goetz.
>>>
>>>
>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>> Sent: Freitag, 8. März 2013 17:13
>>> To: Volker Simonis
>>> Cc: hotspot-dev at openjdk.java.net; David Holmes; Lindenmaier, Goetz; Simonis, Volker
>>> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>>
>>>
>>> I see, thanks for the explanation guys.  Is there no atomic/locked (in x86 sense) store instruction on power? Is that what you mean by multiple atomicity? Not familiar with power, so being ignorant :).
>>>
>>> Sent from my phone
>>> On Mar 8, 2013 11:08 AM, "Volker Simonis" <volker.simonis at gmail.com<mailto:volker.simonis at gmail.com>> wrote:
>>> On Fri, Mar 8, 2013 at 4:31 PM, Lindenmaier, Goetz
>>> <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>> wrote:
>>>> Hi Vitaly,
>>>>
>>>> the writer has a sync after the write to bottom (see pop_local)
>>>> but the store and the sync are not one atomic operation.  Stuff
>>>> can happen before the sync is executed.  Therefore the reader
>>>> must sync, too.
>>>>
>>> Exactly - you have three threads. The first one executes the write
>>> to bottom but not the following sync. The second thread reads this
>>> new value (by chance) and uses it but the third thread still sees the
>>> old value because there's no multipl-copy-atomicity on PPC. So we
>>> really need the sync before the read to assure thread three is not seeing
>>> an older value than thread two.
>>>
>>>> Anyways, in pop_global it should be cheaper than in pop_local.
>>>>
>>>> Best regards,
>>>>      Geotz.
>>>>
>>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
>>>> Sent: Freitag, 8. März 2013 16:13
>>>> To: Lindenmaier, Goetz
>>>> Cc: hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net>; David Holmes; John Cuthbertson; Simonis, Volker
>>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>>>
>>>>
>>>> Hmm, if load fences on the reader aren't sufficient, seems like the problem is missing order/fence on the writer? If the write isn't visible to all processors immediately on power, shouldn't there be a fence in the write to ensure all CPUs see it? Putting the fence in the reader seems odd, but maybe that's just me.
>>>>
>>>> Thanks
>>>>
>>>> Sent from my phone
>>>> On Mar 8, 2013 10:03 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com><mailto:goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>>> wrote:
>>>> Hi Vitaly,
>>>>
>>>> No, the problem is not reordering.  The problem is that
>>>> _bottom, which is read after _age, might be older than _age because
>>>> another processor didn't write it back yet.  The fence (sync) makes the
>>>> current thread wait until it has the new _bottom.
>>>> On Power, a write is not visible to all other threads simultaneously
>>>> (no multipl-copy-atomicity).
>>>>
>>>> What you propose helps if you look at two processors, but not if
>>>> four threads on four processors are involved.
>>>>
>>>> Best regards,
>>>>      Goetz.
>>>>
>>>>
>>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com><mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>>]
>>>> Sent: Freitag, 8. März 2013 15:29
>>>> To: Lindenmaier, Goetz
>>>> Cc: David Holmes; hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net><mailto:hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net>>; John Cuthbertson; Simonis, Volker
>>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>>>
>>>>
>>>> Hi Goetz,
>>>>
>>>> If I understand correctly the problem is that bottom is loaded before age via compiler or arch reordering.  But why is a full fence needed? Is it not sufficient to have just a load fence to load age and relaxed load (just compiler barrier) for bottom?
>
>>>>
>>>> Thanks
>>>>
>>>> Sent from my phone
>>>> On Mar 8, 2013 9:06 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com><mailto:goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>>> wrote:
>>>> Hi,
>>>>
>>>> we have been, and still are, doing research on this issue.
>>>> We want to keep you up to date on this, and propose on the
>>>> further proceeding.
>>>>
>>>> You asked explicit memory ordering operations and a rationale
>>>> why we added them.
>>>>
>>>> Axel found a paper describing the task queue algorithm and the
>>>> needed ordering on arm and power:
>>>> Correct and Efficient Work-Stealing for Weak Memory Models;
>>>> Lê, Pop, Cohen, Nardelli; PPoPP'13;
>>>> http://www.di.ens.fr/~zappa/readings/ppopp13.pdf
>>>>
>>>> According to this paper we need to add one fence and one load_acquire
>>>> to your implementation of the task queue.  You find this fence in this small
>>>> webrev:  http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
>>>>
>>>> With this fence, the algorithm works on Linux for our openjdk ppc
>>>> port, and also for our SAP JVM .
>>>>
>>>> Actually, the fence fixes a problem we discovered with the concurrency
>>>> torture test suite.  The problem occurs with four or more GC threads.
>>>> If three threads are stealing from the queue of the fourth, two of
>>>> them can pop the same element. Without a fence between the access
>>>> to age and bottom in pop_global(), bottom can be older than age.
>>>>
>>>>
>>>> Unfortunately, the OpenJDK implementation with the added fence
>>>> does not work on AIX.  Axel already detected one place where the xlC
>>>> compiler optimization removed load instruction that is required for
>>>> the correctness of the algorithm.  If we use our access routines with load_acquire
>>>> (see original webrev below) everything works, unclear why.
>>>>
>>>> Now, we think C++ might allow that this load is removed and xlC does
>>>> the correct, but unexpected thing.  On the other hand it might also be
>>>> a compiler bug.
>>>> We are currently discussing this with the IBM xlC developers.
>>>>
>>>> Best regards,
>>>>      Axel and Goetz.
>>>>
>>>>
>>>> PS: The webrev we proposed originally:
>>>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>