RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long
Hao Sun
haosun at openjdk.org
Thu Dec 1 09:01:26 UTC 2022
On Tue, 29 Nov 2022 10:00:02 GMT, Andrew Haley <aph at openjdk.org> wrote:
>>> What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)`
>>
>> I think this has been done in the original x86 patch. See https://github.com/openjdk/jdk/pull/9068/files#diff-054ecd9354722843f23556a38d2c24546c8a777b58b3442abea2d5e9fe6bb916R851
>
>> > What would probably work better is to idealize `(cmp (cmp3 x y) < 0)` to `(cmpU x y)`
>>
>> I think this has been done in the original x86 patch. See https://github.com/openjdk/jdk/pull/9068/files#diff-054ecd9354722843f23556a38d2c24546c8a777b58b3442abea2d5e9fe6bb916R851
>
> Interesting. I didn't see this happening. I'll have another look.
Hi @theRealAph ,
> It seems to me that the enhancement here, if it exists, is in the noise. That may well be because the test is dominated by memory traffic.
Yes. I agree. We should use more `compareUnsigned` or use less memory loading in the loop.
I will show you the new test data later.
> Part of the problem is that using `cmp; cset; cneg` doesn't take advantage of branch prediction. And more often than not, the result of a comparison is predictable.
Yes. I agree.
I'd like to share my investigation here.
### Motivation of this intrinsic
I checked the two PRs in [JDK-8283726](https://bugs.openjdk.org/browse/JDK-8283726), i.e. the initial x86 intrinsic task.
>From the discussions of https://github.com/openjdk/jdk/pull/7975 and https://github.com/openjdk/jdk/pull/9068, I think there are two motivations to introduce this `compareUnsigned` intrinsic.
**motivation-1**:
> the compiler can recognise the pattern x + MIN_VALUE < y + MIN_VALUE and transforms it into x u< y. This transformation is fragile however if one of the arguments is in the form x + con
See the discussion [here](https://github.com/openjdk/jdk/pull/7975#issuecomment-1082464476).
I think that's why `bound - 16` is used rather than simply `bound` in [the JMH case](https://github.com/openjdk/jdk/pull/9068/files#diff-0a47e045a0f44094a7ec1fe7a251cc1d99d9fcd9c330cb33a79dd58e7f21b5d6R163).
**motivation-2**:
An optimization can be introduced, i.e. idealizing `(cmp (cmp3 x y) < 0) to (cmpU x y)`
There is `compareUnsignedIndirect` function in the JMH case to show the performance uplift of this optimization.
### Update the JMH case
I'd like to use `Integer` case as an illustrating example, and I think `Long` can be handled in the similar way.
Here shows my update to the JMH case.
// Compare between random values.
// Use big loops
// Operate with another constant before cmpU3
@Benchmark
public void compareUnsignedDirect2(Blackhole bh) {
int r = 0;
int inx1 = 0;
int inx2 = 0;
int i, e1, e2;
for (i = 0; i < size; i++) {
inx1 = intsTiny[i];
inx2 = intsSmall[i];
for (i = 0; i < size; i++) {
inx1 = (inx1 + i) % size;
inx2 = (inx2 + i) % size;
e1 = intsBig[inx1];
e2 = intsBig[inx2];
r += Integer.compareUnsigned(e1, e2 - 16);
r -= Integer.compareUnsigned(r, e1 - 16);
r += Integer.compareUnsigned(r, e2 - 16);
}
}
bh.consume(r);
}
@Benchmark
public void compareUnsignedIndirect2(Blackhole bh) {
int r = 0;
int inx1 = 0;
int inx2 = 0;
int i, e1, e2;
for (i = 0; i < size; i++) {
inx1 = intsTiny[i];
inx2 = intsSmall[i];
for (i = 0; i < size; i++) {
inx1 = (inx1 + i) % size;
inx2 = (inx2 + i) % size;
e1 = intsBig[inx1];
e2 = intsBig[inx2];
r += (Integer.compareUnsigned(e1, e2 - 16) < 0) ? 1 : 0;
r -= (Integer.compareUnsigned(r, e1 - 16) < 0) ? 1 : 0;
r += (Integer.compareUnsigned(r, e2 - 16) < 0) ? 1 : 0;
}
}
bh.consume(r);
}
Note-1: random values are passed to `compareUnisnged` to evaluate `branch predication` affect. Inevitably, memory load is introduced in the inner loop, i.e. getting the values of `e1` and `e2`.
Note-2: big loops are used. In my evaluation, I passed `10240` to parameter `size`.
Note-3: more `compareUnisnged` operations are done in the inner loop.
Note-4: similarly, we pass `- 16` to `compareUnsigned` and use `Direct/Indirect` versions.
### Evaluation data on x86 and aarch64
Note-1: `before` means `disable intrinsic _compareUnsigned_i`, and `after` means `enable it`.
Note-2: the unit is `us/op`. The smaller, the better.
For case **compareUnsignedDirect2**
on AArch64:
after : 55.144 ± 8.887 us/op
before: 51.423 ± 6.891 us/op
on x86:
after : 70.549 ± 3.729 us/op
before: 67.781 ± 2.876 us/op
For case **compareUnsignedIndirect2**
on AArch64:
after : 48.915 ± 11.054 us/op
before: 52.782 ± 8.264 us/op
on x86:
after : 68.107 ± 3.768 us/op
before: 70.231 ± 3.966 us/op
>From the evaluation data, we can see
1) There are **performance regression** for case **compareUnsignedDirect2** on both aarch64 and x86.
I checked the C2 generated code for this case on aarch64, and I don't think `e - 16` can prevent the C2 transformation " x+min_val OP y+min_val -> x uOP y".
I think C2 can generate good enough code and our intrinsic didn't win.
Perhaps, **motivation-1** is not accurate??
2) There are **slight performance uplifts** for case **compareUnsignedIndirect2** on both aarch64 and x86.
Hence, the idealization optimization (See **motivation-2**) works.
I checked the generated code with this optimization and the sequence `cmp; cset; cneg` is gone.
With the investigation above, I personally don't have strong reason to introduce this intrinsic to aarch64 part.
WDYT? @theRealAph
Besides, it would be nice if you could take a look at this discussion @merykitty?
Is there anything I missed or misunderstood? Thanks in advance.
-------------
PR: https://git.openjdk.org/jdk/pull/11383
More information about the hotspot-compiler-dev
mailing list