RFR: 8309583: AArch64: Optimize firstTrue() when amount of elements < 8

Andrew Haley aph at openjdk.org
Thu Jun 8 09:17:47 UTC 2023


On Thu, 8 Jun 2023 02:44:08 GMT, Chang Peng <duke at openjdk.org> wrote:

> This patch optimizes VectorMask.firstTrue() on Neon when there are 2 or 4 elements in vector registers.
> 
> VectorMask.firstTrue() should return VLEGNTH when vector mask is all false [1]. Current implementation uses rbit and then clz [2] to count leading zeros, then uses csel [3] (conditional select) to get the smaller value between VLENGTH and the number of unset lanes to ensure correctness.
> 
> This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4 elements in boolean masks, before rbit and clz. With this trick, maximum value calculated in such case will be VLENGTH (2 or 4).
> 
> Test:
> All vector and vectorapi test passed.
> 
> Performance:
> The benchmark function is like:
> 
> 
> @Benchmark
> public static int testInt() {
>     int res = 0;
>     for (int i = 0; i < LENGTH; i += INT_SPECIES.length()) {
>         VectorMask<Integer> m = VectorMask.fromArray(INT_SPECIES, ia, i);
>         res += m.firstTrue();
>     }
> 
>     return res;
> }
> 
> 
> Following data is collected on a 128-bit Neon machine.
> 
> Benchmark     Before     After     Unit
> testInt       22214.740  25627.833 ops/ms
> testLong      11649.898  13698.535 ops/ms
> 
> [1]: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#firstTrue()
> [2]: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L5540
> [3]: https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/CSEL--Conditional-Select-

Where is the benchmark? You don't seem to have included it in this PR.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14373#issuecomment-1582198629


More information about the hotspot-compiler-dev mailing list