RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long

Thu Dec 1 09:01:26 UTC 2022

On Tue, 29 Nov 2022 10:00:02 GMT, Andrew Haley <aph at openjdk.org> wrote:

>>> What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)`
>> 
>> I think this has been done in the original x86 patch. See https://github.com/openjdk/jdk/pull/9068/files#diff-054ecd9354722843f23556a38d2c24546c8a777b58b3442abea2d5e9fe6bb916R851
>
>> > What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)`
>> 
>> I think this has been done in the original x86 patch. See https://github.com/openjdk/jdk/pull/9068/files#diff-054ecd9354722843f23556a38d2c24546c8a777b58b3442abea2d5e9fe6bb916R851
> 
> Interesting. I didn't see this happening. I'll have another look.

Hi @theRealAph ,

> It seems to me that the enhancement here, if it exists, is in the noise. That may well be because the test is dominated by memory traffic.

Yes. I agree.  We should use more `compareUnsigned` or use less memory loading in the loop.
I will show you the new test data later.

> Part of the problem is that using `cmp; cset; cneg` doesn't take advantage of branch prediction. And more often than not, the result of a comparison is predictable.

Yes. I agree.

I'd like to share my investigation here.

### Motivation of this intrinsic
I checked the two PRs in [JDK-8283726](https://bugs.openjdk.org/browse/JDK-8283726), i.e. the initial x86 intrinsic task.
>From the discussions of https://github.com/openjdk/jdk/pull/7975 and https://github.com/openjdk/jdk/pull/9068, I think there are two motivations to introduce this `compareUnsigned` intrinsic.

**motivation-1**:  
> the compiler can recognise the pattern x + MIN_VALUE < y + MIN_VALUE and transforms it into x u< y. This transformation is fragile however if one of the arguments is in the form x + con

See the discussion [here](https://github.com/openjdk/jdk/pull/7975#issuecomment-1082464476).

I think that's why `bound - 16` is used rather than simply `bound` in [the JMH case](https://github.com/openjdk/jdk/pull/9068/files#diff-0a47e045a0f44094a7ec1fe7a251cc1d99d9fcd9c330cb33a79dd58e7f21b5d6R163).

**motivation-2**:
An optimization can be introduced, i.e. idealizing `(cmp (cmp3 x y) < 0) to (cmpU x y)`

There is `compareUnsignedIndirect` function in the JMH case to show the performance uplift of this optimization.

### Update the JMH case

I'd like to use `Integer` case as an illustrating example, and I think `Long` can be handled in the similar way.
Here shows my update to the JMH case.

    // Compare between random values.
    // Use big loops
    // Operate with another constant before cmpU3
    @Benchmark
    public void compareUnsignedDirect2(Blackhole bh) {
        int r = 0;
        int inx1 = 0;
        int inx2 = 0;
        int i, e1, e2;
        for (i = 0; i < size; i++) {
            inx1 = intsTiny[i];
            inx2 = intsSmall[i];
            for (i = 0; i < size; i++) {
                inx1 = (inx1 + i) % size;
                inx2 = (inx2 + i) % size;
                e1 = intsBig[inx1];
                e2 = intsBig[inx2];
                r += Integer.compareUnsigned(e1, e2 - 16);
                r -= Integer.compareUnsigned(r, e1 - 16);
                r += Integer.compareUnsigned(r, e2 - 16);
            }
        }
        bh.consume(r);
    }

    @Benchmark
    public void compareUnsignedIndirect2(Blackhole bh) {
        int r = 0;
        int inx1 = 0;
        int inx2 = 0;
        int i, e1, e2;
        for (i = 0; i < size; i++) {
            inx1 = intsTiny[i];
            inx2 = intsSmall[i];
            for (i = 0; i < size; i++) {
                inx1 = (inx1 + i) % size;
                inx2 = (inx2 + i) % size;
                e1 = intsBig[inx1];
                e2 = intsBig[inx2];
                r += (Integer.compareUnsigned(e1, e2 - 16) < 0) ? 1 : 0;
                r -= (Integer.compareUnsigned(r, e1 - 16) < 0) ? 1 : 0;
                r += (Integer.compareUnsigned(r, e2 - 16) < 0) ? 1 : 0;
            }
        }
        bh.consume(r);
    }

Note-1: random values are passed to `compareUnisnged` to evaluate `branch predication` affect. Inevitably, memory load is introduced in the inner loop, i.e. getting the values of `e1` and `e2`.
Note-2: big loops are used. In my evaluation, I passed `10240` to parameter `size`.
Note-3: more `compareUnisnged` operations are done in the inner loop.
Note-4: similarly, we pass `- 16` to `compareUnsigned` and use `Direct/Indirect` versions.

### Evaluation data on x86 and aarch64

Note-1: `before` means `disable intrinsic _compareUnsigned_i`, and `after` means `enable it`.
Note-2: the unit is `us/op`. The smaller, the better.

For case **compareUnsignedDirect2**

on AArch64:
  after : 55.144 ± 8.887  us/op
  before: 51.423 ± 6.891  us/op

on x86:
  after : 70.549 ± 3.729  us/op
  before: 67.781 ± 2.876  us/op

For case **compareUnsignedIndirect2**

on AArch64:
  after : 48.915 ± 11.054  us/op
  before: 52.782 ±  8.264  us/op

on x86:
  after : 68.107 ± 3.768  us/op
  before: 70.231 ± 3.966  us/op

>From the evaluation data, we can see
1) There are **performance regression** for case **compareUnsignedDirect2** on both aarch64 and x86.
I checked the C2 generated code for this case on aarch64, and I don't think `e - 16` can prevent the C2 transformation " x+min_val OP y+min_val -> x uOP y". 
I think C2 can generate good enough code and our intrinsic didn't win.
Perhaps, **motivation-1** is not accurate??

2) There are **slight performance uplifts** for case **compareUnsignedIndirect2** on both aarch64 and x86.
Hence, the idealization optimization (See **motivation-2**) works.
I checked the generated code with this optimization and the sequence `cmp; cset; cneg` is gone.

With the investigation above, I personally don't have strong reason to introduce this intrinsic to aarch64 part.
WDYT? @theRealAph 

Besides, it would be nice if you could take a look at this discussion @merykitty?
Is there anything I missed or misunderstood? Thanks in advance.

-------------

PR: https://git.openjdk.org/jdk/pull/11383