RFR(M): Memory ordering in taskqueue.hpp

Mon Mar 11 05:24:18 PDT 2013

I should have been paying a little more attention to this, and will read ppopp13 carefully
later this morning, but in the original Chase-Lev algorithm we did have memory
barriers, but removed them because "that is not house style for SPAA" (!!!!!!!!!!!!!!!)

That paper includes one more memory barrier than we had anticipated; given that they
have real weak-consistency hardware to test on, and a proof to go with it, I would trust
their paper (mostly -- I would not presume to use fewer barriers).  Except for the
growing/shrinking queue, Chase/Lev pretty carefully follows the original ABP design,
so it's also likely to apply in the case of ABP-derived queues that don't grow or shrink.

David

On 2013-03-11, at 4:39 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com> wrote:

> Hi David,
> 
> It's a prerequisite that thread two reads the new value of bottom.
> It it doesn't, there is no problem.
> 
> Thread three will read the new value of age, because that is 
> written with cmpxchg.  And if thread three sees a 
> bottom older than thread two that is inconsistent.
> 
> Best regards,
>  Goetz.
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com] 
> Sent: Samstag, 9. März 2013 00:41
> To: Volker Simonis
> Cc: Lindenmaier, Goetz; Vitaly Davidovich; Simonis, Volker; hotspot-dev at openjdk.java.net
> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
> 
> On 9/03/2013 2:08 AM, Volker Simonis wrote:
>> On Fri, Mar 8, 2013 at 4:31 PM, Lindenmaier, Goetz
>> <goetz.lindenmaier at sap.com> wrote:
>>> Hi Vitaly,
>>> 
>>> the writer has a sync after the write to bottom (see pop_local)
>>> but the store and the sync are not one atomic operation.  Stuff
>>> can happen before the sync is executed.  Therefore the reader
>>> must sync, too.
>>> 
>> 
>> Exactly - you have three threads. The first one executes the write
>> to bottom but not the following sync. The second thread reads this
>> new value (by chance) and uses it but the third thread still sees the
>> old value because there's no multipl-copy-atomicity on PPC. So we
>> really need the sync before the read to assure thread three is not seeing
>> an older value than thread two.
> 
> That doesn't quite make sense to me in isolation because thread two 
> could read the old value just before thread one writes the new one 
> anyway. I would think there would have to be a reordering issue as well, 
> and that is what the sync would fix. (I haven't read the paper.)
> 
> David
> 
>>> Anyways, in pop_global it should be cheaper than in pop_local.
>>> 
>>> Best regards,
>>>   Geotz.
>>> 
>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
>>> Sent: Freitag, 8. März 2013 16:13
>>> To: Lindenmaier, Goetz
>>> Cc: hotspot-dev at openjdk.java.net; David Holmes; John Cuthbertson; Simonis, Volker
>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>> 
>>> 
>>> Hmm, if load fences on the reader aren't sufficient, seems like the problem is missing order/fence on the writer? If the write isn't visible to all processors immediately on power, shouldn't there be a fence in the write to ensure all CPUs see it? Putting the fence in the reader seems odd, but maybe that's just me.
>>> 
>>> Thanks
>>> 
>>> Sent from my phone
>>> On Mar 8, 2013 10:03 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>> wrote:
>>> Hi Vitaly,
>>> 
>>> No, the problem is not reordering.  The problem is that
>>> _bottom, which is read after _age, might be older than _age because
>>> another processor didn't write it back yet.  The fence (sync) makes the
>>> current thread wait until it has the new _bottom.
>>> On Power, a write is not visible to all other threads simultaneously
>>> (no multipl-copy-atomicity).
>>> 
>>> What you propose helps if you look at two processors, but not if
>>> four threads on four processors are involved.
>>> 
>>> Best regards,
>>>   Goetz.
>>> 
>>> 
>>> From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:vitalyd at gmail.com>]
>>> Sent: Freitag, 8. März 2013 15:29
>>> To: Lindenmaier, Goetz
>>> Cc: David Holmes; hotspot-dev at openjdk.java.net<mailto:hotspot-dev at openjdk.java.net>; John Cuthbertson; Simonis, Volker
>>> Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
>>> 
>>> 
>>> Hi Goetz,
>>> 
>>> If I understand correctly the problem is that bottom is loaded before age via compiler or arch reordering.  But why is a full fence needed? Is it not sufficient to have just a load fence to load age and relaxed load (just compiler barrier) for bottom?
>>> 
>>> Thanks
>>> 
>>> Sent from my phone
>>> On Mar 8, 2013 9:06 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com<mailto:goetz.lindenmaier at sap.com>> wrote:
>>> Hi,
>>> 
>>> we have been, and still are, doing research on this issue.
>>> We want to keep you up to date on this, and propose on the
>>> further proceeding.
>>> 
>>> You asked explicit memory ordering operations and a rationale
>>> why we added them.
>>> 
>>> Axel found a paper describing the task queue algorithm and the
>>> needed ordering on arm and power:
>>> Correct and Efficient Work-Stealing for Weak Memory Models;
>>> Lê, Pop, Cohen, Nardelli; PPoPP'13;
>>> http://www.di.ens.fr/~zappa/readings/ppopp13.pdf
>>> 
>>> According to this paper we need to add one fence and one load_acquire
>>> to your implementation of the task queue.  You find this fence in this small
>>> webrev:  http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
>>> 
>>> With this fence, the algorithm works on Linux for our openjdk ppc
>>> port, and also for our SAP JVM .
>>> 
>>> Actually, the fence fixes a problem we discovered with the concurrency
>>> torture test suite.  The problem occurs with four or more GC threads.
>>> If three threads are stealing from the queue of the fourth, two of
>>> them can pop the same element. Without a fence between the access
>>> to age and bottom in pop_global(), bottom can be older than age.
>>> 
>>> 
>>> Unfortunately, the OpenJDK implementation with the added fence
>>> does not work on AIX.  Axel already detected one place where the xlC
>>> compiler optimization removed load instruction that is required for
>>> the correctness of the algorithm.  If we use our access routines with load_acquire
>>> (see original webrev below) everything works, unclear why.
>>> 
>>> Now, we think C++ might allow that this load is removed and xlC does
>>> the correct, but unexpected thing.  On the other hand it might also be
>>> a compiler bug.
>>> We are currently discussing this with the IBM xlC developers.
>>> 
>>> Best regards,
>>>   Axel and Goetz.
>>> 
>>> 
>>> PS: The webrev we proposed originally:
>>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
>>> 
>>> 
>>> 
>>> 
>>> 
>>>