RFR: 8006971: More missing barriers in taskqueues for RMO architectures
David Chase
david.r.chase at oracle.com
Mon Aug 5 22:11:35 PDT 2013
It doesn't quite look like Chase-Lev, because I don't see the queue being reallocated.
It looks like ABP (Arora Blumofe Plaxton) which is the ancestor of all of these, but with
an additional little tag frob plus an overflow check. (Self-memory-managed Chase-Lev is
scary, also relies on 64 bits being a large number.)
My inclination, if we are interested in getting this fixed this week, is to go with the tested
solution that works on PPC, especially since the authors read the paper that purports to
add the necessary barriers, especially since the barriers in that paper look like the ones that
we added when we first wrote the algorithm, before it was "ported" to SPAA's mythical SC
architecture.
Every barrier in that paper -- our draft, or the subsequent paper with proofs and one more barrier --
corresponds to a these-two-things-must-be-ordered point; the only thing that lets you back
off from that at all is if the actual memory order is not complete unordered jelly.
But I just now looked at what Wikipedia claims we are up against on Armv7 and Power,
and I think we need most of them, if not all of them.
I'll keep looking at it.
David
On 2013-08-06, at 12:04 AM, David Holmes <david.holmes at oracle.com> wrote:
> Hi David,
>
> On 6/08/2013 12:14 PM, David Chase wrote:
>> I'm not a reviewer, but if that's ABP queues (a comment pointing to the relevant literature would sure be nice, no matter what it is; this stuff is toxically hard) I am in theory qualified to eyeball it, and I can perhaps nab some people from SSRG upstairs. (It's late and I am still jetlagged, so I am not looking hard tonight, but it sure looks like ABP).
>>
>> If it's not based on ABP queues, say so, and I will look more cautiously and with less authority.
>
> I believe, based on other correspondence, it is based on Chase-Lev work-stealing queue - but I'm not familiar with the ABP queues. I don't know who actually wrote this code but I suspect they are no longer working on the JDK.
>
> There is quite a bit of history with this one. The TaskQueue has been around for a while and works beautifully (as far as we know :) ) on TSO systems. The SAP folk doing the PPC64 port encountered some issues and proposed some fixes. Long story short we settled on this particular patch (a full fence() on the non-TSO systems) for 7u40 as it appeared to fix the known problems (even if potentially a heavier barrier than needed - hence the conditional compilation to not affect TSO systems).
>
> I considered this the "wrong fix" as to me the need for the fence() indicated a missing barrier elsewhere - plus the algorithm should be written without ifdefs - and I hoped we would resolve that before fixing it in JDK 8. But we haven't been able to resolve that and we need this fix in place ASAP - hence the fallback to using the same form as was accepted for 7u40.
>
> So basically this is a forward port of an existing fix and we just need to rubber-stamp it. But one of our original stampers is no longer with us so ...
>
> Thanks,
> David H.
>
>> David
>>
>> On 2013-08-05, at 9:01 PM, David Holmes <david.holmes at oracle.com> wrote:
>>
>>> Hi Vlad
>>>
>>> On 6/08/2013 12:43 AM, Vladimir Danushevsky wrote:
>>>> On Aug 3, 2013, at 6:53 AM, David Holmes wrote:
>>>>> On 2/08/2013 11:57 PM, Vladimir Danushevsky wrote:
>>>>>> The issue of missing memory barriers in the GC taskqueue code was first flagged here:
>>>>>> http://mail.openjdk.java.net/pipermail/hotspot-dev/2013-March/008848.html
>>>>>>
>>>>>> JDK7u40 fix for the same issue is located here:
>>>>>> http://cr.openjdk.java.net/~dholmes/8012144/webrev.hs24/
>>>>>>
>>>>>>
>>>>>> Initially we planned to port same solution to JDK8 however after reviewing the algorithm more we've started questioning a need for a full fence in between 'age' and 'bottom' elements. Since the intent is to keep 'bottom' memory reference from being executed before 'age' would a LoadLoad barrier (which in many cases is a cheaper solution) be sufficient? If so, the webrev below could possible be an adequate solution.
>>>>>>
>>>>>> http://cr.openjdk.java.net/~vladidan/8006971/webrev.00/
>>>>>>
>>>>>> We have tested both cases (fence and LL) on a hexa-core Power5 box running several test suites that currently expose the problem. The results are positive.
>>>>>
>>>>> The loadload() should not be in any ifdef. The loadload() is part of the algorithmic correctness. The loadload() will become a no-op on any platform that does not need to do anything special to preserve the ordering.
>>>>
>>>> As I understand LL is not an issue on platforms that are excluded from emitting the barrier in the provided patch.
>>>>
>>>> However I went to read further discussion at
>>>> http://mail.openjdk.java.net/pipermail/hotspot-dev/2013-March/008848.html
>>>> and seems the concern is the Store is not guaranteed to propagate to all observers if read before the Writer's side 'sync'. I speculate that might not be an issue on PowerPC implementations with L1 cache snooping though, but even if this a case there is no robust way to detect that in runtime.
>>>
>>> That sounds like the fence() is missing on the writer side.
>>>
>>>> But that is likely not an issue on ARM (not sure about IA64, as it was listed in the very first webrev from Goetz) therefore we might inject OrderAccess::fence() for PPC (both 32- and 64-bit) and OrderAccess::loadload() for ARM (again, need info on IA64).
>>>> That being said , for simplicity we can go with fence() for ARM case too since current ARMv7 implementations do not imply a separate barrier instruction for Loads.
>>>>
>>>> So in other words use same patch as in JDK7u40:
>>>> http://cr.openjdk.java.net/~vladidan/8006971/webrev.01
>>>
>>> I think this is masking the real problem with the algorithm but unless we get some true expertise involved here I don't see how we will resolve this. And time is now critical.
>>>
>>> So I will accept the 7u version of the fix under the same rationale as for 7u. We do need this to be fixed. Reviewed.
>>>
>>> I hope this is not the end of it though as this data structure needs to be revisited in my opinion.
>>>
>>> We need a second reviewer.
>>>
>>> Thanks,
>>> David
>>>
>>>> Thanks,
>>>> Vlad
>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>> As a side note -
>>>>>> perhaps it is possible to eliminate age/bottom potential reordering by loading both simultaneously through an Atomic class method. That would require though some structural changes to the layout of TaskQueueSuper class to align both fields together and ensure proper integer alignment (depending on 32/64-bit port), therefore this solution is less practical for the short term.
>>>>>>
>>>>>> Thanks,
>>>>>> Vlad
>>>>>>
>>>>
>>
More information about the hotspot-dev
mailing list