RFR(M): Memory ordering in taskqueue.hpp
David Holmes
david.holmes at oracle.com
Thu Dec 20 18:46:15 PST 2012
On 21/12/2012 6:43 AM, Vitaly Davidovich wrote:
> As an aside, how come storestore is not defined as "asm volatile ("" : :
> : "memory");" on Linux x86/64?
We have an open RFE to improve things
http://bugs.sun.com/view_bug.do?bug_id=6975121
6975121 : OrderAccess should use compiler intrinsics where possible
Our problem has always been getting the right support from the official
compiler.
David
-----
> Thanks
>
> Sent from my phone
>
> On Dec 20, 2012 3:24 PM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com
> <mailto:goetz.lindenmaier at sap.com>> wrote:
>
> Hi David,
>
> I understand that you want selected corrections, not the
> solution we chose that does it everywhere.
>
> But do you agree that our solution does not introduce any
> overhead on x86 and sparc?
>
>
>
> I'll try to explain why I think this is the case using the
> example you brought up:
>
> void set_empty() {
> _bottom = 0;
> _age.set(0);
> }
>
> will result in something like
>
> x86_64 our ppc our ia64
> ------ ------- --------
> movl 0, _bottom swz st4.rel
> movq 0, _age sd st8.rel
>
> which is correct on x86 and ia64, but wrong on power.
>
> void set_empty() {
> _bottom = 0;
> OrderAccess::storestore();
> _age.set(0);
> }
>
> will result in something like
>
> x86_64 our ppc our ia64
> ------ ------- --------
> movl 0, _bottom swz st4.rel
> movl 0, local_dummy lwsync mf
> movq 0, _age sd st8.rel
>
> which is correct, but inefficient on x86_64 (a bit) and ia64.
>
> while
>
> void set_empty() {
> _bottom = 0;
> OrderAccess::release_store_ptr((volatile intptr_t*) &_age, 0);
> }
>
> will result in
>
> movl 0, _bottom swz st4.rel
> movq 0, _age lwsync; sd st8.rel
>
> which is correct and efficient.
>
>
>
> By the way, if you have a PPC port, do you have a machine to build
> and run the openJDK PPC port?
>
> > (modulo OrderAccess is far from perfect and has some outstanding
> issues).
> What are the problems of the OrderAccess implementation?
>
> Best regards,
> Goetz.
>
>
>
>
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>]
> Sent: Thursday, December 20, 2012 2:36 AM
> To: Lindenmaier, Goetz
> Cc: hotspot-dev at openjdk.java.net <mailto:hotspot-dev at openjdk.java.net>
> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>
> On 20/12/2012 2:48 AM, Lindenmaier, Goetz wrote:
> > Hi David,
> >
> > first, thanks for the fast replies. Sorry it took me a day now!
>
> No problem. I can't guarantee response times either :)
>
> >> I didn't say that. I said these two need to be ordered and asked
> which
> >> other paired accesses to these variables need to be ordered.
> > Together with a colleague I looked at the code. To us it seems
> > very hard to identify places where access ordering is not needed.
> > Especially we think there will be places in the users of the
> > task queue that would need barriers to ensure ordering.
> > But we think, the algorithm should be self-contained, doing all the
> > required ordering itself.
>
> I agree the algorithm should be self-contained. My point is that in
> other places where we implement lock-free algorithm was put in the
> barriers/fences in the algorithmic code in the places needed we don't
> automatically wrap all of the variables accesses with code that
> introduces pre/post barriers (of whatever form).
>
> > Further, a barrier will spoil performance on x86 etc, because
> there it
> > issues a lock instruction, which is not needed.
>
> You issue the appropriate barrier/fence for the situation and they
> should be implemented in whatever form needed for the given platform
> (modulo OrderAccess is far from perfect and has some outstanding
> issues).
>
> >> I said that what you were doing was effectively treating the C++
> variables
> >> as if they were Java volatile variables. That's not the way we
> handle
> >> these issues in the VM we tend to place fences/barriers at the
> places in
> >> lock-free algorithms that need them, rather than wrap the
> variables in
> >> accessors that impose fences/barriers that are not always needed.
> >
> > As we understand, the C compilers on x86, sparc etc, enforce far more
> > than required to implement C++ volatile, basically treating the
> fields
> > like Java volatiles. Therefore your code is fine on your platforms.
>
> I think you are confusing things. The existing C/C++ compilers do very
> little in regard to "volatile". It is only the latest multi-threading
> variant of the C++ standard that even addresses this! Our C/C++
> compilers never issue any kind of memory barriers in response to a
> load/store of a volatile variable.
>
> > The C++ compilers on PPC optimize far more, maxing out
> > what the standard allows and the processer supports, leading to
> > immediate errors, but also to very subtle ones.
> >
> > Our change makes explicit what the C compilers (except for PPC) do
> > anyways, and what is required by the algorithm. (Maybe one could now
> > even omit the volatile keywords - but we dare not try.) Thus, we get
> > it right without any overhead on your platforms.
>
> This is getting things confused. The lock-free algorithms should be
> written for the most lenient of memory-models and make no assumptions
> about ordering. They should use the appropriate OrderAccess method where
> needed for algorithmic correctness.
>
> You proposal wraps the C variable in accessors that use OrderAccess
> operations on every access - in a similar way to how we have to handle
> Java volatile accesses. And as I have said that is not the style we use
> in hotspot's lock-free code. This has nothing to do with our compiler
> versus your compiler (though if you have one that issues memory-barriers
> directly then you'll get too much overhead.)
>
> >> Though we have not run into this particular issue. Did
> >> it arise in the context of a particular GC?
> > Issues are only on PPC, at least in parallelScavenge. I reproduced
> > some of the immediate errors, but I don't think this really helps
> with
> > the discussion. I will try G1 soon. Tell me if you want to know
> > more, then I'll try to come up with a wrap-up of our past fixes.
>
> We have a PPC product and as I said we have not encountered this
> problem, but we also run with limited GCs so may not have been
> exercising this code.
>
> Thanks,
> David
>
> > Best regards,
> > Goetz.
> >
> >
> > -----Original Message-----
> > From: David Holmes [mailto:david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>]
> > Sent: Dienstag, 18. Dezember 2012 12:19
> > To: Lindenmaier, Goetz
> > Cc: hotspot-dev at openjdk.java.net
> <mailto:hotspot-dev at openjdk.java.net>
> > Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
> >
> > Hi Goetz,
> >
> > On 18/12/2012 8:47 PM, Lindenmaier, Goetz wrote:
> >> Hi David,
> >>
> >> why do you think only these two accesses have to be ordered?
> >
> > I didn't say that. I said these two need to be ordered and asked
> which
> > other paired accesses to these variables need to be ordered.
> >
> >> As I understand, any two accesses in one thread to these fields
> have to be
> >> orderered and must be written to memory to become visible in the
> proper order
> >> to other threads.
> >
> > That depends on how they are used. It is the overall lock-free
> algorithm
> > that is important here. There may be places where the order of access
> > does not matter (and there may not be).
> >
> >> The PPC compiler really takes all advantages given by the
> >> C-Standard to optimize this, and the processor reorders heavily.
> >
> > Yes I understand. Though we have not run into this particular
> issue. Did
> > it arise in the context of a particular GC?
> >
> >> The specification of volatile in
> >> Age get() const volatile { return _data; }
> >> says that it must be read from memory, not that it can not be
> reordered.
> >
> > I didn't say anything about C/C++ volatile (which doesn't even say it
> > must be read from memory - except perhaps in latest MT C++
> standard?). I
> > said that what you were doing was effectively treating the C++
> variables
> > as if they were Java volatile variables. That's not the way we handle
> > these issues in the VM we tend to place fences/barriers at the
> places in
> > lock-free algorithms that need them, rather than wrap the
> variables in
> > accessors that impose fences/barriers that are not always needed.
> >
> > Cheers,
> > David
> >
> >> Also, doing
> >> OrderAccess::load_acquire((volatile
> idx_t*)&(_age._fields._top))
> >> is better wrt. to optimizations by C-compilers, as there first
> >> is the whole address computation, and then the cast to volatile.
> >> If we use
> >> _age.top()
> >> with
> >> idx_t top() const volatile { return
> OrderAccess::load_acquire((volatile idx_t*)&(_fields._top); }
> >> compilers don't optimize that well. (We saw this on hpia64.)
> >> Further, the access to, e.g., old_age.top() in pop_local_slow()
> would do
> >> the OrderAccess, which is useless overhead.
> >>
> >> Best regards,
> >> Goetz.
> >>
> >>
> >> -----Original Message-----
> >> From: David Holmes [mailto:david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>]
> >> Sent: Dienstag, 18. Dezember 2012 01:24
> >> To: Lindenmaier, Goetz
> >> Cc: hotspot-dev at openjdk.java.net
> <mailto:hotspot-dev at openjdk.java.net>
> >> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
> >>
> >> Hi Goetz,
> >>
> >> On 17/12/2012 10:58 PM, Lindenmaier, Goetz wrote:
> >>> Hi,
> >>>
> >>> Once more, with webrev:
> >>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
> >>
> >> I can see that this function needs a storestore barrier:
> >>
> >> 237 void set_empty() {
> >> 238 _bottom = 0;
> >> 239 _age.set(0);
> >> 240 }
> >>
> >> to preserve ordering. Are there other functions to which we can
> >> constrain the loadload and storestore barriers rather than using the
> >> setters/getters the way you have defined? This is always about a
> pair
> >> (sometimes groups) of accesses so I'd rather deal with the pairs
> than
> >> treat each field as if it were a Java volatile.
> >>
> >> Thanks,
> >> David Holmes
> >>
> >>
> >>> Sorry for that.
> >>>
> >>> I would like to contribute fixes wrt. memory ordering in
> taskqueue.hpp.
> >>>
> >>> On Platfoms with weak memory ordering the taskqueue is not accessed
> >>> properly, as the accesses to the fields _bottom and _age are
> not ordered
> >>> correctly. Volatile is not sufficient to enforce this, because
> it depends on
> >>> what the compiler assumes to be necessary for volatile variables.
> >>>
> >>> The fix is to pull getter/setter routines from Age to
> TaskQueueSuper and use methods from
> >>> OrderAccess to access the fields in _age as well as _bottom.
> Further the code must always
> >>> use the getter/setter methods to access _bottom, _age or the
> subfields of _age.
> >>> On the other hand we can relax constraints for accesses to
> locals oldAge and newAge.
> >>>
> >>> The OrderAccess routines do simple load/stores on x86_64.
> >>>
> >>> I want to discuss this change here and would like very much to
> see it taking
> >>> it's way into OpenJDK to support ports on platforms with weak
> memory
> >>> ordering, and, in particular, our PPC port.
> >>>
> >>> Best regards,
> >>> Goetz.
> >>>
> >>>
>
More information about the hotspot-dev
mailing list