RFR: 8186838: Generalize Atomic::inc/dec with templates
Erik Österlund
erik.osterlund at oracle.com
Mon Sep 4 08:14:48 UTC 2017
Hi Andrew,
On 2017-09-02 10:31, Andrew Haley wrote:
> On 01/09/17 15:15, Erik Österlund wrote:
>> It is not the simplest solution I can think of. The simplest solution I
>> can think of is to remove all specialized versions of Atomic::inc/dec
>> and just have it call Atomic::add directly. That would remove the
>> optimizations we have today, for whatever reason we have them. It would
>> lead to slightly more conservative fencing on PPC/S390,
> I see. Can you say what instructions would be different?
Sure.
Specializations exist on x86, PPC and S390. Removing these
specializations would have the following consequences:
-------------------------------------------------------------------
On x86 Atomic::inc of 4 byte sized types:
lock addl $immediateAddend,(rDest)
becomes
lock xaddl rAddend,(rDest) # stores the value that was there back in
rAddend upon completion
So the inc optimization currently makes sure the addend can be encoded
as an immediate value in the code stream, and exploits that we do not
need to see the returned value. Therefore a lock addl is good enough for
those purposes and does not require the use of an extra register. But it
is not obvious that on a modern machine today that slimmed encoding will
make any significant difference at all. In the contended case it
arguably will not matter.
Similar arguments apply for 8 byte sized types and the Atomic::dec variants.
-------------------------------------------------------------------
On PPC Atomic::inc/dec and Atomic::add have the following differences:
Atomic::inc/dec uses addic between the LL and SC instructions with an
immediate value for adding, whereas Atomic::add uses the add instruction
with an extra register.
Atomic::add has a leading lwsync fence and Atomic::inc/dec has no
leading fence.
Atomic::add has a trailing isync fence and Atomic::inc/dec has no
trailing fence.
So the current implementation of Atomic::add uses heavier fencing than
Atomic::inc/dec. I can imagine that does matter for performance today.
However, the documented semantics of Atomic::inc/dec requires a leading
sync fence - so they are both arguably too weak and should have stronger
fencing than they do today. And I would argue that if both conformed to
the fencing required by our public API, then the difference would
probably be small.
If dodging those fences on PPC is crucial for performance, then I
believe the right way of fixing that is by introducing relaxed atomics
should that be necessary.
-------------------------------------------------------------------
On S390 Atomic::inc/dec and Atomic::add look almost identical. But I
spotted the following tindy differences:
Atomic::inc on 4-byte sized types loads the increment with LGHI, whereas
Atomic::add loads it with LGFR
Similarly, Atomic::inc calculates the new value with AGHI and
Atomic::add calculates the new value with AR.
I am not too familiar with S390, but if I get this right then
Atomic::add uses a fetch_and_add instruction, and then adds the fetched
value by one in the assembly to conform to add_and_fetch semantics.
Atomic::inc also uses a fetch_and_add instruction and seems to also
calculate the add_and_fetch result value, without returning it or in any
other way using it.
If the native fetch_and_add instruction is not available, it resorts to
using a load-link add CAS loop - and they look identical except for
using an immediate value for Atomic::inc.
The same applies for Atomic::dec and 8 byte sized types.
Either way, the differences between add and inc/dec seems to currently
mostly be related to using immediate values vs a register, if I get it
right. And I would be surprised if that makes a huge difference.
-------------------------------------------------------------------
All in all, I would not be unhappy about dropping Atomic::inc
specializations in the name of simplicity, and potentially introducing
relaxed atomics instead for the platforms that rely on fence elision,
should that be required.
Thanks,
/Erik
>
>> and would lead to slightly less optimal machine encoding on x86
>> (without immediate values in the instructions). But it would be
>> simpler for sure. I did not put any judgement into whether our
>> existing optimizations are worthwhile or not. But if you want to
>> prioritize simplicity, removing those optimizations is one possible
>> solution. Would you prefer that?
> Is this really about optimization? If we cared about getting this
> stuff as optimized as possible we'd use intrinsics on GCC/x86 targets.
> These have been supported for a long time. But it seems we're
> determined to preserve the legacy assembly-language implementations
> and use them everywhere, even where they are not necessary.
>
More information about the hotspot-dev
mailing list