RFR(M): Memory ordering in taskqueue.hpp

Volker Simonis volker.simonis at gmail.com
Thu Dec 20 10:15:25 PST 2012


Hi David,

we've already had several of these discussions and you always mention
that either "you don't have these weak memory model problems although
you have a PPC port" or "you've already fixed all the weak memory
model related problems because you have a PPC port". Now as Goetz
mentioned,  we don't know very much about your PPC port except that it
is 32-bit only and that it is used in embedded scenarios. Is this
true? We don't have any expertise in the embedded world and we don't
know what concrete implementations of the Power architecture are
commonly used in that area but we would be happy to hear more about
it.

Our PPC port is 64-bit only and we support Power 5,6 and 7 running
Linux and AIX in server scenarios. Our VM usually runs on big servers
with at least 4 but often more (up to 32) cores and big heaps (>1GB up
to 32GB). Moreover we also support IA64 which is another weak memory
architecture running Linux, Windows and HPUX. Since our VM runs in
productive environments since years we've detected quite some memory
ordering related problems and we've always tried to fix them as
conservatively as possible with regard to shared changes. Because our
proprietary VM also supports the platforms supported by Oracle (i.e.
AMD64 and SPARC) and because many of our customers are productively
using them, we especially paid close attention that our changes for
weak memory architectures won't affect the performance on the existing
Oracle platforms at all and as far as our measurements show this isn't
the case.

You mentioned that our proposed changes don't conform to the style you
currently use in hotspot's lock-free code. That may be true, but that
doesn't necessarily means that that current lock-free code in hotspot
is always correct. The simple example posted by Goetz in
taskqueue-failures.txt shows quite the opposite.

We really think that our changes are mature (our product uses them
since years), proven (on many big customer installations), easy to
understand and non-intrusive in the sense that they don't affect your
supported platforms in a measurable way. And they are absolutely
necessary if you want to seriously port the HotSpot to a weak-memory
architecture like Power or Itanium.

Therefor we would be deeply grateful if you could consider these
changes for inclusion into OpenJDK. Should you encounter any problems
on your platforms with these changes we would me more that happy to
assist you.


Thanks a lot for your patience and support,
Volker

On Thu, Dec 20, 2012 at 2:36 AM, David Holmes <david.holmes at oracle.com> wrote:
> On 20/12/2012 2:48 AM, Lindenmaier, Goetz wrote:
>>
>> Hi David,
>>
>> first, thanks for the fast replies. Sorry it took me a day now!
>
>
> No problem. I can't guarantee response times either :)
>
>
>>> I didn't say that. I said these two need to be ordered and asked which
>>> other paired accesses to these variables need to be ordered.
>>
>> Together with a colleague I looked at the code. To us it seems
>> very hard to identify places where access ordering is not needed.
>> Especially we think there will be places in the users of the
>> task queue that would need barriers to ensure ordering.
>> But we think, the algorithm should be self-contained, doing all the
>> required ordering itself.
>
>
> I agree the algorithm should be self-contained. My point is that in other
> places where we implement lock-free algorithm was put in the barriers/fences
> in the algorithmic code in the places needed we don't automatically wrap all
> of the variables accesses with code that introduces pre/post barriers (of
> whatever form).
>
>
>> Further, a barrier will spoil performance on x86 etc, because there it
>> issues a lock instruction, which is not needed.
>
>
> You issue the appropriate barrier/fence for the situation and they should be
> implemented in whatever form needed for the given platform (modulo
> OrderAccess is far from perfect and has some outstanding issues).
>
>
>>> I said that what you were doing was effectively treating the C++
>>> variables
>>> as if they were Java volatile variables. That's not the way we handle
>>> these issues in the VM we tend to place fences/barriers at the places in
>>> lock-free algorithms that need them, rather than wrap the variables in
>>> accessors that impose fences/barriers that are not always needed.
>>
>>
>> As we understand, the C compilers on x86, sparc etc, enforce far more
>> than required to implement C++ volatile, basically treating the fields
>> like Java volatiles.  Therefore your code is fine on your platforms.
>
>
> I think you are confusing things. The existing C/C++ compilers do very
> little in regard to "volatile". It is only the latest multi-threading
> variant of the C++ standard that even addresses this! Our C/C++ compilers
> never issue any kind of memory barriers in response to a load/store of a
> volatile variable.
>
>
>> The C++ compilers on PPC optimize far more, maxing out
>> what the standard allows and the processer supports, leading to
>> immediate errors, but also to very subtle ones.
>>
>> Our change makes explicit what the C compilers (except for PPC) do
>> anyways, and what is required by the algorithm. (Maybe one could now
>> even omit the volatile keywords - but we dare not try.) Thus, we get
>> it right without any overhead on your platforms.
>
>
> This is getting things confused. The lock-free algorithms should be written
> for the most lenient of memory-models and make no assumptions about
> ordering. They should use the appropriate OrderAccess method where needed
> for algorithmic correctness.
>
> You proposal wraps the C variable in accessors that use OrderAccess
> operations on every access - in a similar way to how we have to handle Java
> volatile accesses. And as I have said that is not the style we use in
> hotspot's lock-free code. This has nothing to do with our compiler versus
> your compiler (though if you have one that issues memory-barriers directly
> then you'll get too much overhead.)
>
>
>>> Though we have not run into this particular issue. Did
>>> it arise in the context of a particular GC?
>>
>> Issues are only on PPC, at least in parallelScavenge. I reproduced
>> some of the immediate errors, but I don't think this really helps with
>> the discussion.  I will try G1 soon.  Tell me if you want to know
>> more, then I'll try to come up with a wrap-up of our past fixes.
>
>
> We have a PPC product and as I said we have not encountered this problem,
> but we also run with limited GCs so may not have been exercising this code.
>
> Thanks,
> David
>
>
>> Best regards,
>>    Goetz.
>>
>>
>> -----Original Message-----
>> From: David Holmes [mailto:david.holmes at oracle.com]
>> Sent: Dienstag, 18. Dezember 2012 12:19
>> To: Lindenmaier, Goetz
>> Cc: hotspot-dev at openjdk.java.net
>> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>
>> Hi Goetz,
>>
>> On 18/12/2012 8:47 PM, Lindenmaier, Goetz wrote:
>>>
>>> Hi David,
>>>
>>> why do you think only these two accesses have to be ordered?
>>
>>
>> I didn't say that. I said these two need to be ordered and asked which
>> other paired accesses to these variables need to be ordered.
>>
>>> As I understand, any two accesses in one thread to these fields have to
>>> be
>>> orderered and must be written to memory to become visible in the proper
>>> order
>>> to other threads.
>>
>>
>> That depends on how they are used. It is the overall lock-free algorithm
>> that is important here. There may be places where the order of access
>> does not matter (and there may not be).
>>
>>> The PPC compiler really takes all advantages given by the
>>> C-Standard to optimize this, and the processor reorders heavily.
>>
>>
>> Yes I understand. Though we have not run into this particular issue. Did
>> it arise in the context of a particular GC?
>>
>>> The specification of volatile in
>>>          Age get() const volatile { return _data; }
>>> says that it must be read from memory, not that it can not be reordered.
>>
>>
>> I didn't say anything about C/C++ volatile (which doesn't even say it
>> must be read from memory - except perhaps in latest MT C++ standard?). I
>> said that what you were doing was effectively treating the C++ variables
>> as if they were Java volatile variables. That's not the way we handle
>> these issues in the VM we tend to place fences/barriers at the places in
>> lock-free algorithms that need them, rather than wrap the variables in
>> accessors that impose fences/barriers that are not always needed.
>>
>> Cheers,
>> David
>>
>>> Also, doing
>>>         OrderAccess::load_acquire((volatile idx_t*)&(_age._fields._top))
>>> is better wrt. to optimizations by C-compilers, as there first
>>> is the whole address computation, and then the cast to volatile.
>>> If we use
>>>         _age.top()
>>> with
>>>         idx_t top() const volatile { return
>>> OrderAccess::load_acquire((volatile idx_t*)&(_fields._top); }
>>> compilers don't optimize that well.  (We saw this on hpia64.)
>>> Further, the access to, e.g., old_age.top() in pop_local_slow() would do
>>> the OrderAccess, which is useless overhead.
>>>
>>> Best regards,
>>>     Goetz.
>>>
>>>
>>> -----Original Message-----
>>> From: David Holmes [mailto:david.holmes at oracle.com]
>>> Sent: Dienstag, 18. Dezember 2012 01:24
>>> To: Lindenmaier, Goetz
>>> Cc: hotspot-dev at openjdk.java.net
>>> Subject: Re: RFR(M): Memory ordering in taskqueue.hpp
>>>
>>> Hi Goetz,
>>>
>>> On 17/12/2012 10:58 PM, Lindenmaier, Goetz wrote:
>>>>
>>>> Hi,
>>>>
>>>> Once more, with webrev:
>>>> http://cr.openjdk.java.net/~goetz/webrevs/webrev-taskqueue/
>>>
>>>
>>> I can see that this function needs a storestore barrier:
>>>
>>>     237   void set_empty() {
>>>     238     _bottom = 0;
>>>     239     _age.set(0);
>>>     240   }
>>>
>>> to preserve ordering. Are there other functions to which we can
>>> constrain the loadload and storestore barriers rather than using the
>>> setters/getters the way you have defined? This is always about a pair
>>> (sometimes groups) of accesses so I'd rather deal with the pairs than
>>> treat each field as if it were a Java volatile.
>>>
>>> Thanks,
>>> David Holmes
>>>
>>>
>>>> Sorry for that.
>>>>
>>>> I would like to contribute fixes wrt. memory ordering in taskqueue.hpp.
>>>>
>>>> On Platfoms with weak memory ordering the taskqueue is not accessed
>>>> properly, as the accesses to the fields _bottom and _age are not ordered
>>>> correctly. Volatile is not sufficient to enforce this, because it
>>>> depends on
>>>> what the compiler assumes to be necessary for volatile variables.
>>>>
>>>> The fix is to pull getter/setter routines from Age to TaskQueueSuper and
>>>> use methods from
>>>> OrderAccess to access the fields in _age as well as _bottom.   Further
>>>> the code must always
>>>> use the getter/setter methods to access _bottom, _age or the subfields
>>>> of _age.
>>>> On the other hand we can relax constraints for accesses to locals oldAge
>>>> and newAge.
>>>>
>>>> The OrderAccess routines do simple load/stores on x86_64.
>>>>
>>>> I want to discuss this change here and would like very much to see it
>>>> taking
>>>> it's way into OpenJDK to support ports on platforms with weak memory
>>>> ordering, and, in particular, our PPC port.
>>>>
>>>> Best regards,
>>>>      Goetz.
>>>>
>>>>
>


More information about the hotspot-dev mailing list