RFR(M): Memory ordering in taskqueue.hpp
David Chase
david.r.chase at oracle.com
Mon Mar 11 05:24:18 PDT 2013
I should have been paying a little more attention to this, and will read ppopp13 carefully
later this morning, but in the original Chase-Lev algorithm we did have memory
barriers, but removed them because "that is not house style for SPAA" (!!!!!!!!!!!!!!!)
That paper includes one more memory barrier than we had anticipated; given that they
have real weak-consistency hardware to test on, and a proof to go with it, I would trust
their paper (mostly -- I would not presume to use fewer barriers). Except for the
growing/shrinking queue, Chase/Lev pretty carefully follows the original ABP design,
so it's also likely to apply in the case of ABP-derived queues that don't grow or shrink.
David
On 2013-03-11, at 4:39 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com> wrote:
> Hi David,
>
> It's a prerequisite that thread two reads the new value of bottom.
> It it doesn't, there is no problem.
>
> Thread three will read the new value of age, because that is
> written with cmpxchg. And if thread three sees a
> bottom older than thread two that is inconsistent.
>
> Best regards,
> Goetz.
>
>
>
>
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Samstag, 9. März 2013 00:41
> To: Volker Simonis
> Cc: Lindenmaier, Goetz; Vitaly Davidovich; Simonis, Volker; hotspot-dev at openjdk.java.net
> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>
> On 9/03/2013 2:08 AM, Volker Simonis wrote:
>> On Fri, Mar 8, 2013 at 4:31 PM, Lindenmaier, Goetz
>> <goetz.lindenmaier at sap.com> wrote:
>>> Hi Vitaly,
>>>
>>> the writer has a sync after the write to bottom (see pop_local)
>>> but the store and the sync are not one atomic operation. Stuff
>>> can happen before the sync is executed. Therefore the reader
>>> must sync, too.
>>>
>>
>> Exactly - you have three threads. The first one executes the write
>> to bottom but not the following sync. The second thread reads this
>> new value (by chance) and uses it but the third thread still sees the
>> old value because there's no multipl-copy-atomicity on PPC. So we
>> really need the sync before the read to assure thread three is not seeing
>> an older value than thread two.
>
> That doesn't quite make sense to me in isolation because thread two
> could read the old value just before thread one writes the new one
> anyway. I would think there would have to be a reordering issue as well,
> and that is what the sync would fix. (I haven't read the paper.)
>
> David
>
>>> Anyways, in pop_global it should be cheaper than in pop_local.
>>>
>>> Best regards,
>>> Geotz.
>>>
>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>> Sent: Freitag, 8. März 2013 16:13
>>> To: Lindenmaier, Goetz
>>> Cc: hotspot-dev at openjdk.java.net; David Holmes; John Cuthbertson; Simonis, Volker
>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>>
>>>
>>> Hmm, if load fences on the reader aren't sufficient, seems like the problem is missing order/fence on the writer? If the write isn't visible to all processors immediately on power, shouldn't there be a fence in the write to ensure all CPUs see it? Putting the fence in the reader seems odd, but maybe that's just me.
>>>
>>> Thanks
>>>
>>> Sent from my phone
>>> On Mar 8, 2013 10:03 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>> wrote:
>>> Hi Vitaly,
>>>
>>> No, the problem is not reordering. The problem is that
>>> _bottom, which is read after _age, might be older than _age because
>>> another processor didn't write it back yet. The fence (sync) makes the
>>> current thread wait until it has the new _bottom.
>>> On Power, a write is not visible to all other threads simultaneously
>>> (no multipl-copy-atomicity).
>>>
>>> What you propose helps if you look at two processors, but not if
>>> four threads on four processors are involved.
>>>
>>> Best regards,
>>> Goetz.
>>>
>>>
>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
>>> Sent: Freitag, 8. März 2013 15:29
>>> To: Lindenmaier, Goetz
>>> Cc: David Holmes; hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net>; John Cuthbertson; Simonis, Volker
>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>>
>>>
>>> Hi Goetz,
>>>
>>> If I understand correctly the problem is that bottom is loaded before age via compiler or arch reordering. But why is a full fence needed? Is it not sufficient to have just a load fence to load age and relaxed load (just compiler barrier) for bottom?
>>>
>>> Thanks
>>>
>>> Sent from my phone
>>> On Mar 8, 2013 9:06 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>> wrote:
>>> Hi,
>>>
>>> we have been, and still are, doing research on this issue.
>>> We want to keep you up to date on this, and propose on the
>>> further proceeding.
>>>
>>> You asked explicit memory ordering operations and a rationale
>>> why we added them.
>>>
>>> Axel found a paper describing the task queue algorithm and the
>>> needed ordering on arm and power:
>>> Correct and Efficient Work-Stealing for Weak Memory Models;
>>> Lê, Pop, Cohen, Nardelli; PPoPP'13;
>>> http://www.di.ens.fr/~zappa/readings/ppopp13.pdf
>>>
>>> According to this paper we need to add one fence and one load_acquire
>>> to your implementation of the task queue. You find this fence in this small
>>> webrev: http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
>>>
>>> With this fence, the algorithm works on Linux for our openjdk ppc
>>> port, and also for our SAP JVM .
>>>
>>> Actually, the fence fixes a problem we discovered with the concurrency
>>> torture test suite. The problem occurs with four or more GC threads.
>>> If three threads are stealing from the queue of the fourth, two of
>>> them can pop the same element. Without a fence between the access
>>> to age and bottom in pop_global(), bottom can be older than age.
>>>
>>>
>>> Unfortunately, the OpenJDK implementation with the added fence
>>> does not work on AIX. Axel already detected one place where the xlC
>>> compiler optimization removed load instruction that is required for
>>> the correctness of the algorithm. If we use our access routines with load_acquire
>>> (see original webrev below) everything works, unclear why.
>>>
>>> Now, we think C++ might allow that this load is removed and xlC does
>>> the correct, but unexpected thing. On the other hand it might also be
>>> a compiler bug.
>>> We are currently discussing this with the IBM xlC developers.
>>>
>>> Best regards,
>>> Axel and Goetz.
>>>
>>>
>>> PS: The webrev we proposed originally:
>>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
>>>
>>>
>>>
>>>
>>>
>>>
More information about the hotspot-dev
mailing list