RFR(M): Memory ordering in taskqueue.hpp
David Holmes
david.holmes at oracle.com
Tue Mar 19 02:37:25 PDT 2013
Hi Goetz,
On 19/03/2013 7:21 PM, Lindenmaier, Goetz wrote:
> Hi John,
>
> Good to see there is work on this!
> We'll be happy to test your change, but we are pretty sure it will fail
> if you don't issue an isync instruction.
> To properly compare our and your implementation we need to
> know how you implemented OrderAccess, i.e., file
> orderAccess_linux_ppc.inline.hpp.
>
> If you may not share this file from your closed source,
> could you please compare it yourself to our implementation?
> Ours is open ;)
> http://hg.openjdk.java.net/ppc-aix-port/jdk7u/hotspot/file/39fdd72ec591/src/os_cpu/linux_ppc/vm/
Looking at the JMM cookbook:
http://g.oswego.edu/dl/jmm/cookbook.html
and your implementation, I note that you use explicit fences (lwsync)
for loadStore and loadLoad, where it is also possible to introduce a
dependent-load+isync to get the same effect.
David
-----
>> BTW - whatever patch is committed - I intend to make you the contributor.
> That's nice, but please account this fix to Axel Siebenborn, he analyzed the bug,
> dug out the paper and did the fix. I only do the openJDK contribution.
> Anyways, I'm off next week, so he will take over.
>
> Best regards,
> Goetz.
>
>
>
> -----Original Message-----
> From: John Cuthbertson [mailto:john.cuthbertson at oracle.com]
> Sent: Montag, 18. März 2013 18:56
> To: Lindenmaier, Goetz
> Cc: Vitaly Davidovich; David Chase; Simonis, Volker; David Holmes; hotspot-dev at openjdk.java.net
> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>
> Hi Geotz,
>
> I have a patch that I synthesized from yours that I gave to one our
> engineers who is working on PowerPC. He was seeing crashes before
> applying your patch and mine. With either patch applied, the crashes
> disappear.
>
> The only real difference between our patches is that I only ordered
> access to the _bottom field and left accesses to the _age field (and
> it's sub fields) unordered. If I give you my patch do you think you
> could test with it? It might take a day or two as I want to get the
> fixes for the G1 marking bugs committed - the stability of G1 is my
> highest priority.
>
> BTW - whatever patch is committed - I intend to make you the contributor.
>
> Cheers,
>
> JohnC
>
> On 3/18/2013 3:27 AM, Lindenmaier, Goetz wrote:
>> Hi,
>>
>> I want to take up this issue again.
>> I think we gave a good reason for the fence we added in the taskqueue.
>> Do you have any remaining objections against this change? We are
>> happy to discuss or adapt the change.
>> If not, could somebody please push this change to hotspot?
>>
>> http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
>>
>> Best regards,
>> Goetz.
>>
>>
>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
>> Sent: Freitag, 8. März 2013 17:13
>> To: Volker Simonis
>> Cc: hotspot-dev at openjdk.java.net; David Holmes; Lindenmaier, Goetz; Simonis, Volker
>> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>
>>
>> I see, thanks for the explanation guys. Is there no atomic/locked (in x86 sense) store instruction on power? Is that what you mean by multiple atomicity? Not familiar with power, so being ignorant :).
>>
>> Sent from my phone
>> On Mar 8, 2013 11:08 AM, "Volker Simonis" <volker.simonis at gmail.com<mailto:volker.simonis at gmail.com>> wrote:
>> On Fri, Mar 8, 2013 at 4:31 PM, Lindenmaier, Goetz
>> <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>> wrote:
>>> Hi Vitaly,
>>>
>>> the writer has a sync after the write to bottom (see pop_local)
>>> but the store and the sync are not one atomic operation. Stuff
>>> can happen before the sync is executed. Therefore the reader
>>> must sync, too.
>>>
>> Exactly - you have three threads. The first one executes the write
>> to bottom but not the following sync. The second thread reads this
>> new value (by chance) and uses it but the third thread still sees the
>> old value because there's no multipl-copy-atomicity on PPC. So we
>> really need the sync before the read to assure thread three is not seeing
>> an older value than thread two.
>>
>>> Anyways, in pop_global it should be cheaper than in pop_local.
>>>
>>> Best regards,
>>> Geotz.
>>>
>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
>>> Sent: Freitag, 8. März 2013 16:13
>>> To: Lindenmaier, Goetz
>>> Cc: hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net>; David Holmes; John Cuthbertson; Simonis, Volker
>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>>
>>>
>>> Hmm, if load fences on the reader aren't sufficient, seems like the problem is missing order/fence on the writer? If the write isn't visible to all processors immediately on power, shouldn't there be a fence in the write to ensure all CPUs see it? Putting the fence in the reader seems odd, but maybe that's just me.
>>>
>>> Thanks
>>>
>>> Sent from my phone
>>> On Mar 8, 2013 10:03 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com><mailto:goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>>> wrote:
>>> Hi Vitaly,
>>>
>>> No, the problem is not reordering. The problem is that
>>> _bottom, which is read after _age, might be older than _age because
>>> another processor didn't write it back yet. The fence (sync) makes the
>>> current thread wait until it has the new _bottom.
>>> On Power, a write is not visible to all other threads simultaneously
>>> (no multipl-copy-atomicity).
>>>
>>> What you propose helps if you look at two processors, but not if
>>> four threads on four processors are involved.
>>>
>>> Best regards,
>>> Goetz.
>>>
>>>
>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com><mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>>]
>>> Sent: Freitag, 8. März 2013 15:29
>>> To: Lindenmaier, Goetz
>>> Cc: David Holmes; hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net><mailto:hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net>>; John Cuthbertson; Simonis, Volker
>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>>
>>>
>>> Hi Goetz,
>>>
>>> If I understand correctly the problem is that bottom is loaded before age via compiler or arch reordering. But why is a full fence needed? Is it not sufficient to have just a load fence to load age and relaxed load (just compiler barrier) for bottom?
>>>
>>> Thanks
>>>
>>> Sent from my phone
>>> On Mar 8, 2013 9:06 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com><mailto:goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>>> wrote:
>>> Hi,
>>>
>>> we have been, and still are, doing research on this issue.
>>> We want to keep you up to date on this, and propose on the
>>> further proceeding.
>>>
>>> You asked explicit memory ordering operations and a rationale
>>> why we added them.
>>>
>>> Axel found a paper describing the task queue algorithm and the
>>> needed ordering on arm and power:
>>> Correct and Efficient Work-Stealing for Weak Memory Models;
>>> Lê, Pop, Cohen, Nardelli; PPoPP'13;
>>> http://www.di.ens.fr/~zappa/readings/ppopp13.pdf
>>>
>>> According to this paper we need to add one fence and one load_acquire
>>> to your implementation of the task queue. You find this fence in this small
>>> webrev: http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
>>>
>>> With this fence, the algorithm works on Linux for our openjdk ppc
>>> port, and also for our SAP JVM .
>>>
>>> Actually, the fence fixes a problem we discovered with the concurrency
>>> torture test suite. The problem occurs with four or more GC threads.
>>> If three threads are stealing from the queue of the fourth, two of
>>> them can pop the same element. Without a fence between the access
>>> to age and bottom in pop_global(), bottom can be older than age.
>>>
>>>
>>> Unfortunately, the OpenJDK implementation with the added fence
>>> does not work on AIX. Axel already detected one place where the xlC
>>> compiler optimization removed load instruction that is required for
>>> the correctness of the algorithm. If we use our access routines with load_acquire
>>> (see original webrev below) everything works, unclear why.
>>>
>>> Now, we think C++ might allow that this load is removed and xlC does
>>> the correct, but unexpected thing. On the other hand it might also be
>>> a compiler bug.
>>> We are currently discussing this with the IBM xlC developers.
>>>
>>> Best regards,
>>> Axel and Goetz.
>>>
>>>
>>> PS: The webrev we proposed originally:
>>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
>>>
>>>
>>>
>>>
>>>
>>>
>
More information about the hotspot-dev
mailing list