RFR(M): Memory ordering in taskqueue.hpp

Thu Dec 20 18:46:15 PST 2012

On 21/12/2012 6:43 AM, Vitaly Davidovich wrote:
> As an aside, how come storestore is not defined as "asm volatile ("" : :
> : "memory");" on Linux x86/64?

We have an open RFE to improve things

http://bugs.sun.com/view_bug.do?bug_id=6975121

6975121 : OrderAccess should use compiler intrinsics where possible

Our problem has always been getting the right support from the official 
compiler.

David
-----

> Thanks
>
> Sent from my phone
>
> On Dec 20, 2012 3:24 PM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com
> <mailto:goetz.lindenmaier at sap.com>> wrote:
>
>     Hi David,
>
>     I understand that you want selected corrections, not the
>     solution we chose that does it everywhere.
>
>     But do you agree that our solution does not introduce any
>     overhead on x86 and sparc?
>
>
>
>     I'll try to explain why I think this is the case using the
>     example you brought up:
>
>     void set_empty() {
>        _bottom = 0;
>        _age.set(0);
>     }
>
>     will result in something like
>
>       x86_64                      our ppc                our ia64
>       ------                      -------                --------
>       movl 0, _bottom             swz                    st4.rel
>       movq 0, _age                sd                     st8.rel
>
>     which is correct on x86 and ia64, but wrong on power.
>
>     void set_empty() {
>        _bottom = 0;
>        OrderAccess::storestore();
>        _age.set(0);
>     }
>
>     will result in something like
>
>       x86_64                      our ppc                our ia64
>       ------                      -------                --------
>       movl 0, _bottom             swz                    st4.rel
>       movl 0, local_dummy         lwsync                 mf
>       movq 0, _age                sd                     st8.rel
>
>     which is correct, but inefficient on x86_64 (a bit) and ia64.
>
>     while
>
>     void set_empty() {
>        _bottom =  0;
>        OrderAccess::release_store_ptr((volatile intptr_t*) &_age, 0);
>     }
>
>     will result in
>
>       movl 0, _bottom             swz                    st4.rel
>       movq 0, _age                lwsync; sd             st8.rel
>
>     which is correct and efficient.
>
>
>
>     By the way, if you have a PPC port, do you have a machine to build
>     and run the openJDK PPC port?
>
>      > (modulo OrderAccess is far from perfect and has some outstanding
>     issues).
>     What are the problems of the OrderAccess implementation?
>
>     Best regards,
>        Goetz.
>
>
>
>
>
>
>     -----Original Message-----
>     From: David Holmes [mailto:david.holmes at oracle.com
>     <mailto:david.holmes at oracle.com>]
>     Sent: Thursday, December 20, 2012 2:36 AM
>     To: Lindenmaier, Goetz
>     Cc: hotspot-dev at openjdk.java.net <mailto:hotspot-dev at openjdk.java.net>
>     Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>
>     On 20/12/2012 2:48 AM, Lindenmaier, Goetz wrote:
>      > Hi David,
>      >
>      > first, thanks for the fast replies. Sorry it took me a day now!
>
>     No problem. I can't guarantee response times either :)
>
>      >> I didn't say that. I said these two need to be ordered and asked
>     which
>      >> other paired accesses to these variables need to be ordered.
>      > Together with a colleague I looked at the code. To us it seems
>      > very hard to identify places where access ordering is not needed.
>      > Especially we think there will be places in the users of the
>      > task queue that would need barriers to ensure ordering.
>      > But we think, the algorithm should be self-contained, doing all the
>      > required ordering itself.
>
>     I agree the algorithm should be self-contained. My point is that in
>     other places where we implement lock-free algorithm was put in the
>     barriers/fences in the algorithmic code in the places needed we don't
>     automatically wrap all of the variables accesses with code that
>     introduces pre/post barriers (of whatever form).
>
>      > Further, a barrier will spoil performance on x86 etc, because
>     there it
>      > issues a lock instruction, which is not needed.
>
>     You issue the appropriate barrier/fence for the situation and they
>     should be implemented in whatever form needed for the given platform
>     (modulo OrderAccess is far from perfect and has some outstanding
>     issues).
>
>      >> I said that what you were doing was effectively treating the C++
>     variables
>      >> as if they were Java volatile variables. That's not the way we
>     handle
>      >> these issues in the VM we tend to place fences/barriers at the
>     places in
>      >> lock-free algorithms that need them, rather than wrap the
>     variables in
>      >> accessors that impose fences/barriers that are not always needed.
>      >
>      > As we understand, the C compilers on x86, sparc etc, enforce far more
>      > than required to implement C++ volatile, basically treating the
>     fields
>      > like Java volatiles.  Therefore your code is fine on your platforms.
>
>     I think you are confusing things. The existing C/C++ compilers do very
>     little in regard to "volatile". It is only the latest multi-threading
>     variant of the C++ standard that even addresses this! Our C/C++
>     compilers never issue any kind of memory barriers in response to a
>     load/store of a volatile variable.
>
>      > The C++ compilers on PPC optimize far more, maxing out
>      > what the standard allows and the processer supports, leading to
>      > immediate errors, but also to very subtle ones.
>      >
>      > Our change makes explicit what the C compilers (except for PPC) do
>      > anyways, and what is required by the algorithm. (Maybe one could now
>      > even omit the volatile keywords - but we dare not try.) Thus, we get
>      > it right without any overhead on your platforms.
>
>     This is getting things confused. The lock-free algorithms should be
>     written for the most lenient of memory-models and make no assumptions
>     about ordering. They should use the appropriate OrderAccess method where
>     needed for algorithmic correctness.
>
>     You proposal wraps the C variable in accessors that use OrderAccess
>     operations on every access - in a similar way to how we have to handle
>     Java volatile accesses. And as I have said that is not the style we use
>     in hotspot's lock-free code. This has nothing to do with our compiler
>     versus your compiler (though if you have one that issues memory-barriers
>     directly then you'll get too much overhead.)
>
>      >> Though we have not run into this particular issue. Did
>      >> it arise in the context of a particular GC?
>      > Issues are only on PPC, at least in parallelScavenge. I reproduced
>      > some of the immediate errors, but I don't think this really helps
>     with
>      > the discussion.  I will try G1 soon.  Tell me if you want to know
>      > more, then I'll try to come up with a wrap-up of our past fixes.
>
>     We have a PPC product and as I said we have not encountered this
>     problem, but we also run with limited GCs so may not have been
>     exercising this code.
>
>     Thanks,
>     David
>
>      > Best regards,
>      >    Goetz.
>      >
>      >
>      > -----Original Message-----
>      > From: David Holmes [mailto:david.holmes at oracle.com
>     <mailto:david.holmes at oracle.com>]
>      > Sent: Dienstag, 18. Dezember 2012 12:19
>      > To: Lindenmaier, Goetz
>      > Cc: hotspot-dev at openjdk.java.net
>     <mailto:hotspot-dev at openjdk.java.net>
>      > Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>      >
>      > Hi Goetz,
>      >
>      > On 18/12/2012 8:47 PM, Lindenmaier, Goetz wrote:
>      >> Hi David,
>      >>
>      >> why do you think only these two accesses have to be ordered?
>      >
>      > I didn't say that. I said these two need to be ordered and asked
>     which
>      > other paired accesses to these variables need to be ordered.
>      >
>      >> As I understand, any two accesses in one thread to these fields
>     have to be
>      >> orderered and must be written to memory to become visible in the
>     proper order
>      >> to other threads.
>      >
>      > That depends on how they are used. It is the overall lock-free
>     algorithm
>      > that is important here. There may be places where the order of access
>      > does not matter (and there may not be).
>      >
>      >> The PPC compiler really takes all advantages given by the
>      >> C-Standard to optimize this, and the processor reorders heavily.
>      >
>      > Yes I understand. Though we have not run into this particular
>     issue. Did
>      > it arise in the context of a particular GC?
>      >
>      >> The specification of volatile in
>      >>          Age get() const volatile { return _data; }
>      >> says that it must be read from memory, not that it can not be
>     reordered.
>      >
>      > I didn't say anything about C/C++ volatile (which doesn't even say it
>      > must be read from memory - except perhaps in latest MT C++
>     standard?). I
>      > said that what you were doing was effectively treating the C++
>     variables
>      > as if they were Java volatile variables. That's not the way we handle
>      > these issues in the VM we tend to place fences/barriers at the
>     places in
>      > lock-free algorithms that need them, rather than wrap the
>     variables in
>      > accessors that impose fences/barriers that are not always needed.
>      >
>      > Cheers,
>      > David
>      >
>      >> Also, doing
>      >>         OrderAccess::load_acquire((volatile
>     idx_t*)&(_age._fields._top))
>      >> is better wrt. to optimizations by C-compilers, as there first
>      >> is the whole address computation, and then the cast to volatile.
>      >> If we use
>      >>         _age.top()
>      >> with
>      >>         idx_t top() const volatile { return
>     OrderAccess::load_acquire((volatile idx_t*)&(_fields._top); }
>      >> compilers don't optimize that well.  (We saw this on hpia64.)
>      >> Further, the access to, e.g., old_age.top() in pop_local_slow()
>     would do
>      >> the OrderAccess, which is useless overhead.
>      >>
>      >> Best regards,
>      >>     Goetz.
>      >>
>      >>
>      >> -----Original Message-----
>      >> From: David Holmes [mailto:david.holmes at oracle.com
>     <mailto:david.holmes at oracle.com>]
>      >> Sent: Dienstag, 18. Dezember 2012 01:24
>      >> To: Lindenmaier, Goetz
>      >> Cc: hotspot-dev at openjdk.java.net
>     <mailto:hotspot-dev at openjdk.java.net>
>      >> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>      >>
>      >> Hi Goetz,
>      >>
>      >> On 17/12/2012 10:58 PM, Lindenmaier, Goetz wrote:
>      >>> Hi,
>      >>>
>      >>> Once more, with webrev:
>      >>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
>      >>
>      >> I can see that this function needs a storestore barrier:
>      >>
>      >>     237   void set_empty() {
>      >>     238     _bottom = 0;
>      >>     239     _age.set(0);
>      >>     240   }
>      >>
>      >> to preserve ordering. Are there other functions to which we can
>      >> constrain the loadload and storestore barriers rather than using the
>      >> setters/getters the way you have defined? This is always about a
>     pair
>      >> (sometimes groups) of accesses so I'd rather deal with the pairs
>     than
>      >> treat each field as if it were a Java volatile.
>      >>
>      >> Thanks,
>      >> David Holmes
>      >>
>      >>
>      >>> Sorry for that.
>      >>>
>      >>> I would like to contribute fixes wrt. memory ordering in
>     taskqueue.hpp.
>      >>>
>      >>> On Platfoms with weak memory ordering the taskqueue is not accessed
>      >>> properly, as the accesses to the fields _bottom and _age are
>     not ordered
>      >>> correctly. Volatile is not sufficient to enforce this, because
>     it depends on
>      >>> what the compiler assumes to be necessary for volatile variables.
>      >>>
>      >>> The fix is to pull getter/setter routines from Age to
>     TaskQueueSuper and use methods from
>      >>> OrderAccess to access the fields in _age as well as _bottom.
>     Further the code must always
>      >>> use the getter/setter methods to access _bottom, _age or the
>     subfields of _age.
>      >>> On the other hand we can relax constraints for accesses to
>     locals oldAge and newAge.
>      >>>
>      >>> The OrderAccess routines do simple load/stores on x86_64.
>      >>>
>      >>> I want to discuss this change here and would like very much to
>     see it taking
>      >>> it's way into OpenJDK to support ports on platforms with weak
>     memory
>      >>> ordering, and, in particular, our PPC port.
>      >>>
>      >>> Best regards,
>      >>>      Goetz.
>      >>>
>      >>>
>