RFR(M): Memory ordering in taskqueue.hpp
Lindenmaier, Goetz
goetz.lindenmaier at sap.com
Thu Dec 20 12:23:50 PST 2012
Hi David,
I understand that you want selected corrections, not the
solution we chose that does it everywhere.
But do you agree that our solution does not introduce any
overhead on x86 and sparc?
I'll try to explain why I think this is the case using the
example you brought up:
void set_empty() {
_bottom = 0;
_age.set(0);
}
will result in something like
x86_64 our ppc our ia64
------ ------- --------
movl 0, _bottom swz st4.rel
movq 0, _age sd st8.rel
which is correct on x86 and ia64, but wrong on power.
void set_empty() {
_bottom = 0;
OrderAccess::storestore();
_age.set(0);
}
will result in something like
x86_64 our ppc our ia64
------ ------- --------
movl 0, _bottom swz st4.rel
movl 0, local_dummy lwsync mf
movq 0, _age sd st8.rel
which is correct, but inefficient on x86_64 (a bit) and ia64.
while
void set_empty() {
_bottom = 0;
OrderAccess::release_store_ptr((volatile intptr_t*) &_age, 0);
}
will result in
movl 0, _bottom swz st4.rel
movq 0, _age lwsync; sd st8.rel
which is correct and efficient.
By the way, if you have a PPC port, do you have a machine to build
and run the openJDK PPC port?
> (modulo OrderAccess is far from perfect and has some outstanding issues).
What are the problems of the OrderAccess implementation?
Best regards,
Goetz.
-----Original Message-----
From: David Holmes [mailto:david.holmes at oracle.com]
Sent: Thursday, December 20, 2012 2:36 AM
To: Lindenmaier, Goetz
Cc: hotspot-dev at openjdk.java.net
Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
On 20/12/2012 2:48 AM, Lindenmaier, Goetz wrote:
> Hi David,
>
> first, thanks for the fast replies. Sorry it took me a day now!
No problem. I can't guarantee response times either :)
>> I didn't say that. I said these two need to be ordered and asked which
>> other paired accesses to these variables need to be ordered.
> Together with a colleague I looked at the code. To us it seems
> very hard to identify places where access ordering is not needed.
> Especially we think there will be places in the users of the
> task queue that would need barriers to ensure ordering.
> But we think, the algorithm should be self-contained, doing all the
> required ordering itself.
I agree the algorithm should be self-contained. My point is that in
other places where we implement lock-free algorithm was put in the
barriers/fences in the algorithmic code in the places needed we don't
automatically wrap all of the variables accesses with code that
introduces pre/post barriers (of whatever form).
> Further, a barrier will spoil performance on x86 etc, because there it
> issues a lock instruction, which is not needed.
You issue the appropriate barrier/fence for the situation and they
should be implemented in whatever form needed for the given platform
(modulo OrderAccess is far from perfect and has some outstanding issues).
>> I said that what you were doing was effectively treating the C++ variables
>> as if they were Java volatile variables. That's not the way we handle
>> these issues in the VM we tend to place fences/barriers at the places in
>> lock-free algorithms that need them, rather than wrap the variables in
>> accessors that impose fences/barriers that are not always needed.
>
> As we understand, the C compilers on x86, sparc etc, enforce far more
> than required to implement C++ volatile, basically treating the fields
> like Java volatiles. Therefore your code is fine on your platforms.
I think you are confusing things. The existing C/C++ compilers do very
little in regard to "volatile". It is only the latest multi-threading
variant of the C++ standard that even addresses this! Our C/C++
compilers never issue any kind of memory barriers in response to a
load/store of a volatile variable.
> The C++ compilers on PPC optimize far more, maxing out
> what the standard allows and the processer supports, leading to
> immediate errors, but also to very subtle ones.
>
> Our change makes explicit what the C compilers (except for PPC) do
> anyways, and what is required by the algorithm. (Maybe one could now
> even omit the volatile keywords - but we dare not try.) Thus, we get
> it right without any overhead on your platforms.
This is getting things confused. The lock-free algorithms should be
written for the most lenient of memory-models and make no assumptions
about ordering. They should use the appropriate OrderAccess method where
needed for algorithmic correctness.
You proposal wraps the C variable in accessors that use OrderAccess
operations on every access - in a similar way to how we have to handle
Java volatile accesses. And as I have said that is not the style we use
in hotspot's lock-free code. This has nothing to do with our compiler
versus your compiler (though if you have one that issues memory-barriers
directly then you'll get too much overhead.)
>> Though we have not run into this particular issue. Did
>> it arise in the context of a particular GC?
> Issues are only on PPC, at least in parallelScavenge. I reproduced
> some of the immediate errors, but I don't think this really helps with
> the discussion. I will try G1 soon. Tell me if you want to know
> more, then I'll try to come up with a wrap-up of our past fixes.
We have a PPC product and as I said we have not encountered this
problem, but we also run with limited GCs so may not have been
exercising this code.
Thanks,
David
> Best regards,
> Goetz.
>
>
> -----Original Message-----
> From: David Holmes [mailto:david.holmes at oracle.com]
> Sent: Dienstag, 18. Dezember 2012 12:19
> To: Lindenmaier, Goetz
> Cc: hotspot-dev at openjdk.java.net
> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>
> Hi Goetz,
>
> On 18/12/2012 8:47 PM, Lindenmaier, Goetz wrote:
>> Hi David,
>>
>> why do you think only these two accesses have to be ordered?
>
> I didn't say that. I said these two need to be ordered and asked which
> other paired accesses to these variables need to be ordered.
>
>> As I understand, any two accesses in one thread to these fields have to be
>> orderered and must be written to memory to become visible in the proper order
>> to other threads.
>
> That depends on how they are used. It is the overall lock-free algorithm
> that is important here. There may be places where the order of access
> does not matter (and there may not be).
>
>> The PPC compiler really takes all advantages given by the
>> C-Standard to optimize this, and the processor reorders heavily.
>
> Yes I understand. Though we have not run into this particular issue. Did
> it arise in the context of a particular GC?
>
>> The specification of volatile in
>> Age get() const volatile { return _data; }
>> says that it must be read from memory, not that it can not be reordered.
>
> I didn't say anything about C/C++ volatile (which doesn't even say it
> must be read from memory - except perhaps in latest MT C++ standard?). I
> said that what you were doing was effectively treating the C++ variables
> as if they were Java volatile variables. That's not the way we handle
> these issues in the VM we tend to place fences/barriers at the places in
> lock-free algorithms that need them, rather than wrap the variables in
> accessors that impose fences/barriers that are not always needed.
>
> Cheers,
> David
>
>> Also, doing
>> OrderAccess::load_acquire((volatile idx_t*)&(_age._fields._top))
>> is better wrt. to optimizations by C-compilers, as there first
>> is the whole address computation, and then the cast to volatile.
>> If we use
>> _age.top()
>> with
>> idx_t top() const volatile { return OrderAccess::load_acquire((volatile idx_t*)&(_fields._top); }
>> compilers don't optimize that well. (We saw this on hpia64.)
>> Further, the access to, e.g., old_age.top() in pop_local_slow() would do
>> the OrderAccess, which is useless overhead.
>>
>> Best regards,
>> Goetz.
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Dienstag, 18. Dezember 2012 01:24
>> To: Lindenmaier, Goetz
>> Cc: hotspot-dev at openjdk.java.net
>> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>
>> Hi Goetz,
>>
>> On 17/12/2012 10:58 PM, Lindenmaier, Goetz wrote:
>>> Hi,
>>>
>>> Once more, with webrev:
>>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
>>
>> I can see that this function needs a storestore barrier:
>>
>> 237 void set_empty() {
>> 238 _bottom = 0;
>> 239 _age.set(0);
>> 240 }
>>
>> to preserve ordering. Are there other functions to which we can
>> constrain the loadload and storestore barriers rather than using the
>> setters/getters the way you have defined? This is always about a pair
>> (sometimes groups) of accesses so I'd rather deal with the pairs than
>> treat each field as if it were a Java volatile.
>>
>> Thanks,
>> David Holmes
>>
>>
>>> Sorry for that.
>>>
>>> I would like to contribute fixes wrt. memory ordering in taskqueue.hpp.
>>>
>>> On Platfoms with weak memory ordering the taskqueue is not accessed
>>> properly, as the accesses to the fields _bottom and _age are not ordered
>>> correctly. Volatile is not sufficient to enforce this, because it depends on
>>> what the compiler assumes to be necessary for volatile variables.
>>>
>>> The fix is to pull getter/setter routines from Age to TaskQueueSuper and use methods from
>>> OrderAccess to access the fields in _age as well as _bottom. Further the code must always
>>> use the getter/setter methods to access _bottom, _age or the subfields of _age.
>>> On the other hand we can relax constraints for accesses to locals oldAge and newAge.
>>>
>>> The OrderAccess routines do simple load/stores on x86_64.
>>>
>>> I want to discuss this change here and would like very much to see it taking
>>> it's way into OpenJDK to support ports on platforms with weak memory
>>> ordering, and, in particular, our PPC port.
>>>
>>> Best regards,
>>> Goetz.
>>>
>>>
More information about the hotspot-dev
mailing list