RFR: 8186838: Generalize Atomic::inc/dec with templates

Tue Sep 5 00:38:07 UTC 2017

> On Sep 4, 2017, at 10:59 PM, David Holmes <david.holmes at oracle.com> wrote:
> 
> Hi Erik,
> 
> On 4/09/2017 5:15 PM, Erik Österlund wrote:
>> Hi David,
>> On 2017-09-04 03:24, David Holmes wrote:
>>> Hi Erik,
>>> 
>>> On 1/09/2017 8:49 PM, Erik Österlund wrote:
>>>> Hi David,
>>>> 
>>>> The shared structure for all operations is the following:
>>>> 
>>>> An Atomic::something call creates a SomethingImpl function object that performs some basic type checking and then forwards the call straight to a PlatformSomething function object. This PlatformSomething object could decide to do anything. But to make life easier, it may inherit from a shared SomethingHelper function object with CRTP that calls back into the PlatformSomething function object to emit inline assembly.
>>> 
>>> Right, but! Lets look at some details.
>>> 
>>> Atomic::add
>>>   AddImpl
>>>     PlatformAdd<size>
>>>       FetchAndAdd
>>>       AddAndFetch
>>>       add_using_helper
>>> 
>>> Atomic::cmpxchg
>>>   CmpxchgImpl
>>>     PlatformCmpxchg<size>
>>>       cmpxchg_using_helper
>>> 
>>> Atomic::inc
>>>   IncImpl
>>>     PlatformInc<size, Adjustment>
>>>       IncUsingConstant
>>> 
>>> Why is it that the simplest operation (inc/dec) has the most complex platform template definition? Why do we need Adjustment? You previously said "Adjustment represents the increment/decrement value as an IntegralConstant - your template friend for passing around a constant with both a specified type and value in templates". But add passes around values and doesn't need this. Further inc/dec don't need to pass anything around anywhere - inc adds 1, dec subtracts 1! This "1" does not need to appear anywhere in the API or get passed across layers - the only place this "1" becomes evident is in the actual platform asm that does the logic of "add 1" or "subtract 1".
>>> 
>>> My understanding from previous discussions is that much of the template machinations was to deal with type management for "dest" and the values being passed around. But here, for inc/dec there are no values being passed so we don't have to make "dest" type-compatible with any value.
>> Dealing with different types being passed in is one part of the problem - a problem that almost all operations seems to have. But Atomic::add and inc/dec have more problems to deal with.
>> The Atomic::add operation has two more problems that cmpxchg does not have.
>> 1) It needs to scale pointer arithmetic. So if you have a P* and you add it by 2, then you really add the underlying value by 2 * sizeof(P), and the scaled addend needs to be of the right type - the type of the destination for integral types and ptrdiff_t for pointers. This is similar semantics to ++pointer.
> 
> I'll address this below - but yes I overlooked this aspect.
> 
>> 2) It connects backends with different semantics - either fetch_and_add or add_and_fetch to a common public interface with add_and_fetch semantics.
> 
> Not at all clear why this has to manifest in the upper/middle layers instead of being handled by the actual lowest-layer ??
> 
>> This is the reason that Atomic::add might appear more complicated than Atomic::cmpxchg. Because Atomic::cmpxchg only had the different type problems to deal with - no pointer arithmetics.
>> The reason why Atomic::inc/dec looks more complicated than Atomic::add is that it needs to preserve the pointer arithmetic as constants rather than values, because the scaled addend is embedded in the inline assembly as immediate values. Therefore it passes around an IntegralConstant that embeds both the type and size of the addend. And it is not just 1/-1. For integral destinations the constant used is 1/-1 of the type stored at the destination. For pointers the constant is ptrdiff_t with a value representing the size of the element pointed to.
> 
> This is insanely complicated (I think that counts as 'accidental complexity' per Andrew's comment ;-) ). Pointer arithmetic is a basic/fundamental part of C/C++, yet this template stuff has to jump through multiple inverted hoops to do something the language "just does"! All this complexity to manage a conversion addend -> addend * sizeof(*dest) ??
> 
> And the fact that inc/dec are simpler than add, yet result in far more complicated templates because the simpler addend is a constant, is just as unfathomable to me!
> 
>> Having said that - I am not opposed to simply removing the specializations of inc/dec if we are scared of the complexity of passing this constant to the platform layer. After running a bunch of benchmarks over the weekend, it showed no significant regressions after removal. Now of course that might not tell the full story - it could have missed that some critical operation in the JVM takes longer. But I would be very surprised if that was the case.
> 
> I can imagine we use an "add immediate" form for inc/dec of 1, do we actually use that for other values? I would expect inc_ptr/dec_ptr to always translate to add_ptr, with no special case for when ptr is char* and so we only add/sub 1. ??

[Delurking briefly.]

Sorry I've been silent until now in this discussion, but I'm on
vacation, and won't have time until next week to really pay attention.
But this seems to have gone somewhat awry, so I'm popping in briefly.

David objected to some of the complexity, apparently based on
forgetting the scaling for pointer arithmetic. That seems to have been
cleared up.

However, David (and I think others) are also objecting to other points
of complexity, and I think I agree. I was working on an approach that
was structurally similar to Atomic::add, but using IntegralContant to
retain access to the literal value, for those platforms that benefit
from that. Erik's proposal also uses IntegralContant for that purpose,
but (from a quick skim) I think got that part wrong, and that is a
source of additional complexity. Erik might want to re-read my handoff
email to him.  I don't know whether that approach would satisfy folks
though. 

I was also looking into the possibility that more platforms might be
able to just use Atomic::add to implement Atomic::inc (and maybe
Atomic::dec), without an change to the generated code for inc/dec.
This would be accomplished by improving the inline assembler for
Atomic::add, using an "n" constraint for the addend when appropriate.
In some cases this perhaps might be done by providing it as an
alternative (e.g. using an "nr" constraint). I hadn't gotten gotten
very far in exploring that possibility though, so it might not go
anywhere.

And I agree the existing barriers in inc/dec for powerpc (both aix and
linux) look contrary to the documented requirements.

I'm a little bit reluctant to just give up on per-platform
microoptimized inc/dec and simply transform those operations into
corresponding add operations.  Requiring an additional register and
it's initialization for some platforms seems like poor form.

If this discussion hasn't reached consensus by next week, I'll start
working with Erik then to get us there.