RFR: 8186838: Generalize Atomic::inc/dec with templates

Mon Sep 4 08:14:48 UTC 2017

Hi Andrew,

On 2017-09-02 10:31, Andrew Haley wrote:
> On 01/09/17 15:15, Erik Österlund wrote:
>> It is not the simplest solution I can think of. The simplest solution I
>> can think of is to remove all specialized versions of Atomic::inc/dec
>> and just have it call Atomic::add directly. That would remove the
>> optimizations we have today, for whatever reason we have them. It would
>> lead to slightly more conservative fencing on PPC/S390,
> I see.  Can you say what instructions would be different?

Sure.

Specializations exist on x86, PPC and S390. Removing these 
specializations would have the following consequences:

-------------------------------------------------------------------

On x86 Atomic::inc of 4 byte sized types:
lock addl $immediateAddend,(rDest)
becomes
lock xaddl rAddend,(rDest) # stores the value that was there back in 
rAddend upon completion

So the inc optimization currently makes sure the addend can be encoded 
as an immediate value in the code stream, and exploits that we do not 
need to see the returned value. Therefore a lock addl is good enough for 
those purposes and does not require the use of an extra register. But it 
is not obvious that on a modern machine today that slimmed encoding will 
make any significant difference at all. In the contended case it 
arguably will not matter.

Similar arguments apply for 8 byte sized types and the Atomic::dec variants.

-------------------------------------------------------------------

On PPC Atomic::inc/dec and Atomic::add have the following differences:

Atomic::inc/dec uses addic between the LL and SC instructions with an 
immediate value for adding, whereas Atomic::add uses the add instruction 
with an extra register.
Atomic::add has a leading lwsync fence and Atomic::inc/dec has no 
leading fence.
Atomic::add has a trailing isync fence and Atomic::inc/dec has no 
trailing fence.

So the current implementation of Atomic::add uses heavier fencing than 
Atomic::inc/dec. I can imagine that does matter for performance today.
However, the documented semantics of Atomic::inc/dec requires a leading 
sync fence - so they are both arguably too weak and should have stronger 
fencing than they do today. And I would argue that if both conformed to 
the fencing required by our public API, then the difference would 
probably be small.

If dodging those fences on PPC is crucial for performance, then I 
believe the right way of fixing that is by introducing relaxed atomics 
should that be necessary.

-------------------------------------------------------------------

On S390 Atomic::inc/dec and Atomic::add look almost identical. But I 
spotted the following tindy differences:

Atomic::inc on 4-byte sized types loads the increment with LGHI, whereas 
Atomic::add loads it with LGFR
Similarly, Atomic::inc calculates the new value with AGHI and 
Atomic::add calculates the new value with AR.

I am not too familiar with S390, but if I get this right then 
Atomic::add uses a fetch_and_add instruction, and then adds the fetched 
value by one in the assembly to conform to add_and_fetch semantics. 
Atomic::inc also uses a fetch_and_add instruction and seems to also 
calculate the add_and_fetch result value, without returning it or in any 
other way using it.

If the native fetch_and_add instruction is not available, it resorts to 
using a load-link add CAS loop - and they look identical except for 
using an immediate value for Atomic::inc.

The same applies for Atomic::dec and 8 byte sized types.
Either way, the differences between add and inc/dec seems to currently 
mostly be related to using immediate values vs a register, if I get it 
right. And I would be surprised if that makes a huge difference.

-------------------------------------------------------------------

All in all, I would not be unhappy about dropping Atomic::inc 
specializations in the name of simplicity, and potentially introducing 
relaxed atomics instead for the platforms that rely on fence elision, 
should that be required.

Thanks,
/Erik

>
>> and would lead to slightly less optimal machine encoding on x86
>> (without immediate values in the instructions). But it would be
>> simpler for sure. I did not put any judgement into whether our
>> existing optimizations are worthwhile or not. But if you want to
>> prioritize simplicity, removing those optimizations is one possible
>> solution. Would you prefer that?
> Is this really about optimization?  If we cared about getting this
> stuff as optimized as possible we'd use intrinsics on GCC/x86 targets.
> These have been supported for a long time.  But it seems we're
> determined to preserve the legacy assembly-language implementations
> and use them everywhere, even where they are not necessary.
>