RFR: 8309583: AArch64: Optimize firstTrue() when amount of elements < 8 [v3]
Andrew Haley
aph at openjdk.org
Wed Jun 21 15:06:05 UTC 2023
On Mon, 19 Jun 2023 02:06:27 GMT, Chang Peng <duke at openjdk.org> wrote:
>> This patch optimizes VectorMask.firstTrue() on Neon when there are 2 or 4 elements in vector registers.
>>
>> VectorMask.firstTrue() should return VLEGNTH when vector mask is all false [1]. Current implementation uses rbit and then clz [2] to count leading zeros, then uses csel [3] (conditional select) to get the smaller value between VLENGTH and the number of unset lanes to ensure correctness.
>>
>> This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4 elements in boolean masks, before rbit and clz. With this trick, maximum value calculated in such case will be VLENGTH (2 or 4).
>>
>> Test:
>> All vector and vectorapi test passed.
>>
>> Performance:
>> The benchmark functions are in MaskQueryOperationsBenchmark.java [4]. This patch also modifies above benchmark to measure mask operations' performance more effectively.
>>
>> Following data is collected on a 128-bit Neon machine.
>>
>> Benchmark (inputs) Mode Before After Units
>> MaskQueryOperationsBenchmark.testFirstTrueInt 1 thrpt 5952.670 7298.491 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt 2 thrpt 5951.513 7297.620 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt 3 thrpt 5953.048 7298.072 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong 1 thrpt 3496.990 4003.188 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong 2 thrpt 3497.755 4002.577 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong 3 thrpt 3500.085 4002.471 ops/ms
>>
>> [1]: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#firstTrue()
>> [2]: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L5540
>> [3]: https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/CSEL--Conditional-Select-
>
> Chang Peng has updated the pull request incrementally with one additional commit since the last revision:
>
> Update MaskQueryOperationsBenchmark.java
So I'm looking at the results of the patch and I see:
Before
Benchmark (inputs) Mode Cnt Score Error Units
MaskQueryOperationsBenchmark.testFirstTrueInt 1 avgt 3 69.547 ± 2.837 ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt 2 avgt 3 69.549 ± 0.497 ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt 3 avgt 3 69.506 ± 1.360 ns/op
After:
Benchmark (inputs) Mode Cnt Score Error Units
MaskQueryOperationsBenchmark.testFirstTrueInt 1 avgt 3 58.955 ± 0.838 ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt 2 avgt 3 58.690 ± 2.940 ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt 3 avgt 3 58.923 ± 1.088 ns/op
which corresponds with a change from
0x00000001158ef748: fmov x11, d16
0x00000001158ef74c: rbit x11, x11
0x00000001158ef750: clz x11, x11
0x00000001158ef754: lsr w11, w11, #3
;; 0x4
0x00000001158ef758: orr w8, wzr, #0x4
0x00000001158ef75c: cmp w11, w8
0x00000001158ef760: csel w11, w8, w11, ge // ge = tcont
```
to
0x0000000115f3f8e8: fmov x14, d16
0x0000000115f3f8ec: orr x14, x14, #0x100000000
0x0000000115f3f8f0: rbit x14, x14
0x0000000115f3f8f4: clz x14, x14
0x0000000115f3f8f8: lsr w14, w14, #3
That's a pretty decent speedup when you consider that the benchmark is dominated by memory operations and vector->core register moves.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/14373#issuecomment-1601008191
More information about the hotspot-compiler-dev
mailing list