RFR: 8309583: AArch64: Optimize firstTrue() when amount of elements < 8 [v2]
Chang Peng
duke at openjdk.org
Mon Jun 19 01:45:59 UTC 2023
> This patch optimizes VectorMask.firstTrue() on Neon when there are 2 or 4 elements in vector registers.
>
> VectorMask.firstTrue() should return VLEGNTH when vector mask is all false [1]. Current implementation uses rbit and then clz [2] to count leading zeros, then uses csel [3] (conditional select) to get the smaller value between VLENGTH and the number of unset lanes to ensure correctness.
>
> This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4 elements in boolean masks, before rbit and clz. With this trick, maximum value calculated in such case will be VLENGTH (2 or 4).
>
> Test:
> All vector and vectorapi test passed.
>
> Performance:
> The benchmark function is like:
>
>
> @Benchmark
> public static int testInt() {
> int res = 0;
> for (int i = 0; i < LENGTH; i += INT_SPECIES.length()) {
> VectorMask<Integer> m = VectorMask.fromArray(INT_SPECIES, ia, i);
> res += m.firstTrue();
> }
>
> return res;
> }
>
>
> Following data is collected on a 128-bit Neon machine.
>
> Benchmark Before After Unit
> testInt 22214.740 25627.833 ops/ms
> testLong 11649.898 13698.535 ops/ms
>
> [1]: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#firstTrue()
> [2]: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L5540
> [3]: https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/CSEL--Conditional-Select-
Chang Peng has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:
- Merge branch 'openjdk:master' into optimize_firsttrue2e4e_neon
- 8309583: AArch64: Optimize firstTrue() when amount of elements < 8
This patch optimizes VectorMask.firstTrue() on Neon when there are 2
or 4 elements in vector registers.
VectorMask.firstTrue() should return VLEGNTH when vector mask is all
false [1]. Current implementation uses rbit and then clz [2] to count
leading zeros, then uses csel [3] (conditional select) to get the
smaller value between VLENGTH and the number of unset lanes to ensure
correctness.
This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4
elements in boolean masks, before rbit and clz. With this trick, maximum
value calculated in such case will be VLENGTH (2 or 4).
Test:
All vector and vectorapi test passed.
Performance:
The benchmark function is like:
```
@Benchmark
public static int testInt() {
int res = 0;
for (int i = 0; i < LENGTH; i += INT_SPECIES.length()) {
VectorMask<Integer> m = VectorMask.fromArray(INT_SPECIES, ia, i);
res += m.firstTrue();
}
return res;
}
```
Following data is collected on a 128-bit Neon machine.
Benchmark Before After Unit
testInt 22214.740 25627.833 ops/ms
testLong 11649.898 13698.535 ops/ms
[1]: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#firstTrue()
[2]: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L5540
[3]: https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/CSEL--Conditional-Select-
Change-Id: I4a2de805ffa4469f88d510c96617eae165f0e025
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/14373/files
- new: https://git.openjdk.org/jdk/pull/14373/files/24b6d738..d8507105
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=14373&range=01
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=14373&range=00-01
Stats: 82117 lines in 1520 files changed: 59805 ins; 16698 del; 5614 mod
Patch: https://git.openjdk.org/jdk/pull/14373.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/14373/head:pull/14373
PR: https://git.openjdk.org/jdk/pull/14373
More information about the hotspot-compiler-dev
mailing list