RFR(M): Memory ordering in taskqueue.hpp

Fri Dec 21 06:55:50 PST 2012

I see, thanks.  While there are other fish to fry with OrderAccess, I think
using that inline asm for storestore should work just fine on the GCC
versions that you use (there's already other inline asm there for other
methods), at least until someone attempts to convert these to compiler
intrinsics.

Thanks

Sent from my phone
On Dec 20, 2012 9:46 PM, "David Holmes" <david.holmes at oracle.com> wrote:

> On 21/12/2012 6:43 AM, Vitaly Davidovich wrote:
>
>> As an aside, how come storestore is not defined as "asm volatile ("" : :
>> : "memory");" on Linux x86/64?
>>
>
> We have an open RFE to improve things
>
> http://bugs.sun.com/view_bug.**do?bug_id=6975121<http://bugs.sun.com/view_bug.do?bug_id=6975121>
>
> 6975121 : OrderAccess should use compiler intrinsics where possible
>
> Our problem has always been getting the right support from the official
> compiler.
>
> David
> -----
>
>
>  Thanks
>>
>> Sent from my phone
>>
>> On Dec 20, 2012 3:24 PM, "Lindenmaier, Goetz" <goetz.lindenmaier at sap.com
>> <mailto:goetz.lindenmaier at sap.**com <goetz.lindenmaier at sap.com>>> wrote:
>>
>>     Hi David,
>>
>>     I understand that you want selected corrections, not the
>>     solution we chose that does it everywhere.
>>
>>     But do you agree that our solution does not introduce any
>>     overhead on x86 and sparc?
>>
>>
>>
>>     I'll try to explain why I think this is the case using the
>>     example you brought up:
>>
>>     void set_empty() {
>>        _bottom = 0;
>>        _age.set(0);
>>     }
>>
>>     will result in something like
>>
>>       x86_64                      our ppc                our ia64
>>       ------                      -------                --------
>>       movl 0, _bottom             swz                    st4.rel
>>       movq 0, _age                sd                     st8.rel
>>
>>     which is correct on x86 and ia64, but wrong on power.
>>
>>     void set_empty() {
>>        _bottom = 0;
>>        OrderAccess::storestore();
>>        _age.set(0);
>>     }
>>
>>     will result in something like
>>
>>       x86_64                      our ppc                our ia64
>>       ------                      -------                --------
>>       movl 0, _bottom             swz                    st4.rel
>>       movl 0, local_dummy         lwsync                 mf
>>       movq 0, _age                sd                     st8.rel
>>
>>     which is correct, but inefficient on x86_64 (a bit) and ia64.
>>
>>     while
>>
>>     void set_empty() {
>>        _bottom =  0;
>>        OrderAccess::release_store_**ptr((volatile intptr_t*) &_age, 0);
>>     }
>>
>>     will result in
>>
>>       movl 0, _bottom             swz                    st4.rel
>>       movq 0, _age                lwsync; sd             st8.rel
>>
>>     which is correct and efficient.
>>
>>
>>
>>     By the way, if you have a PPC port, do you have a machine to build
>>     and run the openJDK PPC port?
>>
>>      > (modulo OrderAccess is far from perfect and has some outstanding
>>     issues).
>>     What are the problems of the OrderAccess implementation?
>>
>>     Best regards,
>>        Goetz.
>>
>>
>>
>>
>>
>>
>>     -----Original Message-----
>>     From: David Holmes [mailto:david.holmes at oracle.**com<david.holmes at oracle.com>
>>     <mailto:david.holmes at oracle.**com <david.holmes at oracle.com>>]
>>     Sent: Thursday, December 20, 2012 2:36 AM
>>     To: Lindenmaier, Goetz
>>     Cc: hotspot-dev at openjdk.java.net <mailto:hotspot-dev at openjdk.**
>> java.net <hotspot-dev at openjdk.java.net>>
>>     Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>
>>     On 20/12/2012 2:48 AM, Lindenmaier, Goetz wrote:
>>      > Hi David,
>>      >
>>      > first, thanks for the fast replies. Sorry it took me a day now!
>>
>>     No problem. I can't guarantee response times either :)
>>
>>      >> I didn't say that. I said these two need to be ordered and asked
>>     which
>>      >> other paired accesses to these variables need to be ordered.
>>      > Together with a colleague I looked at the code. To us it seems
>>      > very hard to identify places where access ordering is not needed.
>>      > Especially we think there will be places in the users of the
>>      > task queue that would need barriers to ensure ordering.
>>      > But we think, the algorithm should be self-contained, doing all the
>>      > required ordering itself.
>>
>>     I agree the algorithm should be self-contained. My point is that in
>>     other places where we implement lock-free algorithm was put in the
>>     barriers/fences in the algorithmic code in the places needed we don't
>>     automatically wrap all of the variables accesses with code that
>>     introduces pre/post barriers (of whatever form).
>>
>>      > Further, a barrier will spoil performance on x86 etc, because
>>     there it
>>      > issues a lock instruction, which is not needed.
>>
>>     You issue the appropriate barrier/fence for the situation and they
>>     should be implemented in whatever form needed for the given platform
>>     (modulo OrderAccess is far from perfect and has some outstanding
>>     issues).
>>
>>      >> I said that what you were doing was effectively treating the C++
>>     variables
>>      >> as if they were Java volatile variables. That's not the way we
>>     handle
>>      >> these issues in the VM we tend to place fences/barriers at the
>>     places in
>>      >> lock-free algorithms that need them, rather than wrap the
>>     variables in
>>      >> accessors that impose fences/barriers that are not always needed.
>>      >
>>      > As we understand, the C compilers on x86, sparc etc, enforce far
>> more
>>      > than required to implement C++ volatile, basically treating the
>>     fields
>>      > like Java volatiles.  Therefore your code is fine on your
>> platforms.
>>
>>     I think you are confusing things. The existing C/C++ compilers do very
>>     little in regard to "volatile". It is only the latest multi-threading
>>     variant of the C++ standard that even addresses this! Our C/C++
>>     compilers never issue any kind of memory barriers in response to a
>>     load/store of a volatile variable.
>>
>>      > The C++ compilers on PPC optimize far more, maxing out
>>      > what the standard allows and the processer supports, leading to
>>      > immediate errors, but also to very subtle ones.
>>      >
>>      > Our change makes explicit what the C compilers (except for PPC) do
>>      > anyways, and what is required by the algorithm. (Maybe one could
>> now
>>      > even omit the volatile keywords - but we dare not try.) Thus, we
>> get
>>      > it right without any overhead on your platforms.
>>
>>     This is getting things confused. The lock-free algorithms should be
>>     written for the most lenient of memory-models and make no assumptions
>>     about ordering. They should use the appropriate OrderAccess method
>> where
>>     needed for algorithmic correctness.
>>
>>     You proposal wraps the C variable in accessors that use OrderAccess
>>     operations on every access - in a similar way to how we have to handle
>>     Java volatile accesses. And as I have said that is not the style we
>> use
>>     in hotspot's lock-free code. This has nothing to do with our compiler
>>     versus your compiler (though if you have one that issues
>> memory-barriers
>>     directly then you'll get too much overhead.)
>>
>>      >> Though we have not run into this particular issue. Did
>>      >> it arise in the context of a particular GC?
>>      > Issues are only on PPC, at least in parallelScavenge. I reproduced
>>      > some of the immediate errors, but I don't think this really helps
>>     with
>>      > the discussion.  I will try G1 soon.  Tell me if you want to know
>>      > more, then I'll try to come up with a wrap-up of our past fixes.
>>
>>     We have a PPC product and as I said we have not encountered this
>>     problem, but we also run with limited GCs so may not have been
>>     exercising this code.
>>
>>     Thanks,
>>     David
>>
>>      > Best regards,
>>      >    Goetz.
>>      >
>>      >
>>      > -----Original Message-----
>>      > From: David Holmes [mailto:david.holmes at oracle.**com<david.holmes at oracle.com>
>>     <mailto:david.holmes at oracle.**com <david.holmes at oracle.com>>]
>>      > Sent: Dienstag, 18. Dezember 2012 12:19
>>      > To: Lindenmaier, Goetz
>>      > Cc: hotspot-dev at openjdk.java.net
>>     <mailto:hotspot-dev at openjdk.**java.net <hotspot-dev at openjdk.java.net>
>> >
>>      > Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>      >
>>      > Hi Goetz,
>>      >
>>      > On 18/12/2012 8:47 PM, Lindenmaier, Goetz wrote:
>>      >> Hi David,
>>      >>
>>      >> why do you think only these two accesses have to be ordered?
>>      >
>>      > I didn't say that. I said these two need to be ordered and asked
>>     which
>>      > other paired accesses to these variables need to be ordered.
>>      >
>>      >> As I understand, any two accesses in one thread to these fields
>>     have to be
>>      >> orderered and must be written to memory to become visible in the
>>     proper order
>>      >> to other threads.
>>      >
>>      > That depends on how they are used. It is the overall lock-free
>>     algorithm
>>      > that is important here. There may be places where the order of
>> access
>>      > does not matter (and there may not be).
>>      >
>>      >> The PPC compiler really takes all advantages given by the
>>      >> C-Standard to optimize this, and the processor reorders heavily.
>>      >
>>      > Yes I understand. Though we have not run into this particular
>>     issue. Did
>>      > it arise in the context of a particular GC?
>>      >
>>      >> The specification of volatile in
>>      >>          Age get() const volatile { return _data; }
>>      >> says that it must be read from memory, not that it can not be
>>     reordered.
>>      >
>>      > I didn't say anything about C/C++ volatile (which doesn't even say
>> it
>>      > must be read from memory - except perhaps in latest MT C++
>>     standard?). I
>>      > said that what you were doing was effectively treating the C++
>>     variables
>>      > as if they were Java volatile variables. That's not the way we
>> handle
>>      > these issues in the VM we tend to place fences/barriers at the
>>     places in
>>      > lock-free algorithms that need them, rather than wrap the
>>     variables in
>>      > accessors that impose fences/barriers that are not always needed.
>>      >
>>      > Cheers,
>>      > David
>>      >
>>      >> Also, doing
>>      >>         OrderAccess::load_acquire((**volatile
>>     idx_t*)&(_age._fields._top))
>>      >> is better wrt. to optimizations by C-compilers, as there first
>>      >> is the whole address computation, and then the cast to volatile.
>>      >> If we use
>>      >>         _age.top()
>>      >> with
>>      >>         idx_t top() const volatile { return
>>     OrderAccess::load_acquire((**volatile idx_t*)&(_fields._top); }
>>      >> compilers don't optimize that well.  (We saw this on hpia64.)
>>      >> Further, the access to, e.g., old_age.top() in pop_local_slow()
>>     would do
>>      >> the OrderAccess, which is useless overhead.
>>      >>
>>      >> Best regards,
>>      >>     Goetz.
>>      >>
>>      >>
>>      >> -----Original Message-----
>>      >> From: David Holmes [mailto:david.holmes at oracle.**com<david.holmes at oracle.com>
>>     <mailto:david.holmes at oracle.**com <david.holmes at oracle.com>>]
>>      >> Sent: Dienstag, 18. Dezember 2012 01:24
>>      >> To: Lindenmaier, Goetz
>>      >> Cc: hotspot-dev at openjdk.java.net
>>     <mailto:hotspot-dev at openjdk.**java.net <hotspot-dev at openjdk.java.net>
>> >
>>      >> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>      >>
>>      >> Hi Goetz,
>>      >>
>>      >> On 17/12/2012 10:58 PM, Lindenmaier, Goetz wrote:
>>      >>> Hi,
>>      >>>
>>      >>> Once more, with webrev:
>>      >>> http://cr.openjdk.java.net/~**goetz/webrevs/webrev-**taskqueue/<http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/>
>>      >>
>>      >> I can see that this function needs a storestore barrier:
>>      >>
>>      >>     237   void set_empty() {
>>      >>     238     _bottom = 0;
>>      >>     239     _age.set(0);
>>      >>     240   }
>>      >>
>>      >> to preserve ordering. Are there other functions to which we can
>>      >> constrain the loadload and storestore barriers rather than using
>> the
>>      >> setters/getters the way you have defined? This is always about a
>>     pair
>>      >> (sometimes groups) of accesses so I'd rather deal with the pairs
>>     than
>>      >> treat each field as if it were a Java volatile.
>>      >>
>>      >> Thanks,
>>      >> David Holmes
>>      >>
>>      >>
>>      >>> Sorry for that.
>>      >>>
>>      >>> I would like to contribute fixes wrt. memory ordering in
>>     taskqueue.hpp.
>>      >>>
>>      >>> On Platfoms with weak memory ordering the taskqueue is not
>> accessed
>>      >>> properly, as the accesses to the fields _bottom and _age are
>>     not ordered
>>      >>> correctly. Volatile is not sufficient to enforce this, because
>>     it depends on
>>      >>> what the compiler assumes to be necessary for volatile variables.
>>      >>>
>>      >>> The fix is to pull getter/setter routines from Age to
>>     TaskQueueSuper and use methods from
>>      >>> OrderAccess to access the fields in _age as well as _bottom.
>>     Further the code must always
>>      >>> use the getter/setter methods to access _bottom, _age or the
>>     subfields of _age.
>>      >>> On the other hand we can relax constraints for accesses to
>>     locals oldAge and newAge.
>>      >>>
>>      >>> The OrderAccess routines do simple load/stores on x86_64.
>>      >>>
>>      >>> I want to discuss this change here and would like very much to
>>     see it taking
>>      >>> it's way into OpenJDK to support ports on platforms with weak
>>     memory
>>      >>> ordering, and, in particular, our PPC port.
>>      >>>
>>      >>> Best regards,
>>      >>>      Goetz.
>>      >>>
>>      >>>
>>
>>