RFR(M): Memory ordering in taskqueue.hpp

Fri Mar 8 08:12:33 PST 2013

I see, thanks for the explanation guys.  Is there no atomic/locked (in x86
sense) store instruction on power? Is that what you mean by multiple
atomicity? Not familiar with power, so being ignorant :).

Sent from my phone
On Mar 8, 2013 11:08 AM, "Volker Simonis" <volker.simonis at gmail.com> wrote:

> On Fri, Mar 8, 2013 at 4:31 PM, Lindenmaier, Goetz
> <goetz.lindenmaier at sap.com> wrote:
> > Hi Vitaly,
> >
> > the writer has a sync after the write to bottom (see pop_local)
> > but the store and the sync are not one atomic operation.  Stuff
> > can happen before the sync is executed.  Therefore the reader
> > must sync, too.
> >
>
> Exactly - you have three threads. The first one executes the write
> to bottom but not the following sync. The second thread reads this
> new value (by chance) and uses it but the third thread still sees the
> old value because there's no multipl-copy-atomicity on PPC. So we
> really need the sync before the read to assure thread three is not seeing
> an older value than thread two.
>
> > Anyways, in pop_global it should be cheaper than in pop_local.
> >
> > Best regards,
> >   Geotz.
> >
> > From: Vitaly Davidovich [mailto:vitalyd at gmail.com]
> > Sent: Freitag, 8. März 2013 16:13
> > To: Lindenmaier, Goetz
> > Cc: hotspot-dev at openjdk.java.net; David Holmes; John Cuthbertson;
> Simonis, Volker
> > Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
> >
> >
> > Hmm, if load fences on the reader aren't sufficient, seems like the
> problem is missing order/fence on the writer? If the write isn't visible to
> all processors immediately on power, shouldn't there be a fence in the
> write to ensure all CPUs see it? Putting the fence in the reader seems odd,
> but maybe that's just me.
> >
> > Thanks
> >
> > Sent from my phone
> > On Mar 8, 2013 10:03 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com
> <mailto:goetz.lindenmaier at sap.com>> wrote:
> > Hi Vitaly,
> >
> > No, the problem is not reordering.  The problem is that
> > _bottom, which is read after _age, might be older than _age because
> > another processor didn't write it back yet.  The fence (sync) makes the
> > current thread wait until it has the new _bottom.
> > On Power, a write is not visible to all other threads simultaneously
> > (no multipl-copy-atomicity).
> >
> > What you propose helps if you look at two processors, but not if
> > four threads on four processors are involved.
> >
> > Best regards,
> >   Goetz.
> >
> >
> > From: Vitaly Davidovich [mailto:vitalyd at gmail.com<mailto:
> vitalyd at gmail.com>]
> > Sent: Freitag, 8. März 2013 15:29
> > To: Lindenmaier, Goetz
> > Cc: David Holmes; hotspot-dev at openjdk.java.net<mailto:
> hotspot-dev at openjdk.java.net>; John Cuthbertson; Simonis, Volker
> > Subject: RE: RFR(M): Memory ordering in taskqueue.hpp
> >
> >
> > Hi Goetz,
> >
> > If I understand correctly the problem is that bottom is loaded before
> age via compiler or arch reordering.  But why is a full fence needed? Is it
> not sufficient to have just a load fence to load age and relaxed load (just
> compiler barrier) for bottom?
> >
> > Thanks
> >
> > Sent from my phone
> > On Mar 8, 2013 9:06 AM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com
> <mailto:goetz.lindenmaier at sap.com>> wrote:
> > Hi,
> >
> > we have been, and still are, doing research on this issue.
> > We want to keep you up to date on this, and propose on the
> > further proceeding.
> >
> > You asked explicit memory ordering operations and a rationale
> > why we added them.
> >
> > Axel found a paper describing the task queue algorithm and the
> > needed ordering on arm and power:
> > Correct and Efficient Work-Stealing for Weak Memory Models;
> > Lê, Pop, Cohen, Nardelli; PPoPP'13;
> > http://www.di.ens.fr/~zappa/readings/ppopp13.pdf
> >
> > According to this paper we need to add one fence and one load_acquire
> > to your implementation of the task queue.  You find this fence in this
> small
> > webrev:  http://cr.openjdk.java.net/~goetz/webrevs/8006971-2/
> >
> > With this fence, the algorithm works on Linux for our openjdk ppc
> > port, and also for our SAP JVM .
> >
> > Actually, the fence fixes a problem we discovered with the concurrency
> > torture test suite.  The problem occurs with four or more GC threads.
> > If three threads are stealing from the queue of the fourth, two of
> > them can pop the same element. Without a fence between the access
> > to age and bottom in pop_global(), bottom can be older than age.
> >
> >
> > Unfortunately, the OpenJDK implementation with the added fence
> > does not work on AIX.  Axel already detected one place where the xlC
> > compiler optimization removed load instruction that is required for
> > the correctness of the algorithm.  If we use our access routines with
> load_acquire
> > (see original webrev below) everything works, unclear why.
> >
> > Now, we think C++ might allow that this load is removed and xlC does
> > the correct, but unexpected thing.  On the other hand it might also be
> > a compiler bug.
> > We are currently discussing this with the IBM xlC developers.
> >
> > Best regards,
> >   Axel and Goetz.
> >
> >
> > PS: The webrev we proposed originally:
> > http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
> >
> >
> >
> >
> >
> >
>