RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long

Mon Dec 5 11:44:09 UTC 2022

On Mon, 5 Dec 2022 09:13:16 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>>> The motivation for these intrinsics, aside from unsigned comparison being a fairly basic operation, is for range checks of a load/store, which has the form of `0 <= offset && offset <= length - size`, by transforming this into `0 <= length - size && offset u<= length - size`, the first comparison as well as the computation of `length - size` can be hoisted out of the loop, which results in a single operation being loop varying. The reason this may not be recognised effectively by the idealiser is that if `size` is a constant, `(length - size) + MIN_VALUE` being folded into `length + (MIN_VALUE - size)`, breaks the pattern.
>>> 
>> @merykitty Thanks for your explanation. 
>> But I'm afraid I didn't fully get your point.
>> I think `offset u<= length - size` will be matched with `CmpU` node, rather than `CmpU3` node, right?
>> 
>>> @shqking I think your benchmark is not good, as the error is higher than the actual difference between samples, and the cost of the division may swallow everything inside the loop.
>>>
>> How about the case as suggested by aph, i.e. shown in the previous comment?
>> 
>> Thanks~
>
> @shqking
> 
>> But I'm afraid I didn't fully get your point.
> I think offset u<= length - size will be matched with CmpU node, rather than CmpU3 node, right?
> 
> `CmpU3` serves the same purpose as `CmpL3` or `CmpD3`, that is to be an intermediate node that gets elided immediately. For example, this piece of code:
> 
>     static int square(long x, long y) {
>         return x > y ? 1 : 0;
>     }
> 
> gets compiled into:
> 
>     static int square(long, long);
>         0: lload_0
>         1: lload_2
>         2: lcmp
>         3: ifle          10
>         6: iconst_1
>         7: goto          11
>        10: iconst_0
>        11: ireturn
> 
> `lcmp` is parsed into a `CmpL3`, so the comparison part is parsed as `CmpI (CmpL3 x y) 0`, which is then transformed into `CmpL x y`.
> 
> Similarly, `Integer.compareUnsigned(x, y) > 0` is parsed as `CmpI (CmpU3 x y) 0`, which is then transformed into `CmpU x y`.
> 
>> How about the case as suggested by aph, i.e. shown in the previous comment?
> 
> I think it is better but I still think measuring the performance of `CmpU3` node is quite irrelevent.
> 
> Thanks.

@merykitty Thanks a lot for your prompt explanation.

Here is my understanding of node `CmpU3`.
1) Generation: As an intrinsic, `Integer.compareUnsigned()` will be compiled into `CmpU3` node. The corresponding JMH test case is `compareUnsignedDirect()` function.
2) Optimization: Similar to `CmpL3`, `CmpF3` and `CmpD3` nodes, idealization optimization is conducted, transforming `CmpI (CmpU3 x y) 0` into `CmpU x y`. As you mentioned, in this case `CmpU3` node works as an intermediate node. The corresponding JMH test case is `compareUnsignedIndirect()` function.

Here list two quotes from your previous comments.
> IMO the direct result of the method is less important, because the contract does not have any promise with respect to the exact return value, and the only thing that can be done with it is to compare it with 0, which will certainly be folded into a CmpU node.
>
> I think it is better but I still think measuring the performance of CmpU3 node is quite irrelevent.
>
I guess your point is that it's irrelevant to evaluate the performance of `CmpU3` node mainly because **in most scenarios, the result of `Integer.compareUnsigned(x, y)` would be compared with 0**, and `CmpU3` will be optimized out finally.

If my understanding is correct, may I ask how can we draw this conclusion, from observation or statistical results of the usages of `Integer.compareUnsigned(x, y)` methods in exisiting popular Java application?

Besides, I believe you may understand what we (aph and I) are concerned about already, but I'd like to describe it again in case I didn't make myself understood clearly. Our discussion was actually the performance of `CmpU3` node, because in AArch64, `cmp; cset; cneg` sequence would be generated, which is predictable compared to the C2 generated code, where the advantage of branch prediction can be utilized. From the performance data shown previously, I'm afraid it's hard to say the intrinsification code is always better than C2 compiled one.

Thanks.

-------------

PR: https://git.openjdk.org/jdk/pull/11383