RFR: 8309583: AArch64: Optimize firstTrue() when amount of elements < 8 [v3]

Wed Jun 21 15:06:05 UTC 2023

On Mon, 19 Jun 2023 02:06:27 GMT, Chang Peng <duke at openjdk.org> wrote:

>> This patch optimizes VectorMask.firstTrue() on Neon when there are 2 or 4 elements in vector registers.
>> 
>> VectorMask.firstTrue() should return VLEGNTH when vector mask is all false [1]. Current implementation uses rbit and then clz [2] to count leading zeros, then uses csel [3] (conditional select) to get the smaller value between VLENGTH and the number of unset lanes to ensure correctness.
>> 
>> This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4 elements in boolean masks, before rbit and clz. With this trick, maximum value calculated in such case will be VLENGTH (2 or 4).
>> 
>> Test:
>> All vector and vectorapi test passed.
>> 
>> Performance:
>> The benchmark functions are in MaskQueryOperationsBenchmark.java [4]. This patch also modifies above benchmark to measure mask operations' performance more effectively.
>> 
>> Following data is collected on a 128-bit Neon machine.
>> 
>> Benchmark                                                                        (inputs)   Mode  Before    After    Units
>> MaskQueryOperationsBenchmark.testFirstTrueInt            1            thrpt  5952.670  7298.491 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt            2            thrpt  5951.513  7297.620 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt            3            thrpt  5953.048  7298.072 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong        1            thrpt  3496.990  4003.188 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong        2            thrpt  3497.755  4002.577 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong        3            thrpt  3500.085  4002.471 ops/ms
>> 
>> [1]: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#firstTrue()
>> [2]: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L5540
>> [3]: https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/CSEL--Conditional-Select-
>
> Chang Peng has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update MaskQueryOperationsBenchmark.java

So I'm looking at the results of the patch and I see:

Before

Benchmark                                           (inputs)  Mode  Cnt   Score   Error  Units
MaskQueryOperationsBenchmark.testFirstTrueInt              1  avgt    3  69.547 ± 2.837  ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt              2  avgt    3  69.549 ± 0.497  ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt              3  avgt    3  69.506 ± 1.360  ns/op

After:

Benchmark                                           (inputs)  Mode  Cnt   Score   Error  Units
MaskQueryOperationsBenchmark.testFirstTrueInt              1  avgt    3  58.955 ± 0.838  ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt              2  avgt    3  58.690 ± 2.940  ns/op
MaskQueryOperationsBenchmark.testFirstTrueInt              3  avgt    3  58.923 ± 1.088  ns/op

which corresponds with a change from

            0x00000001158ef748:   fmov	x11, d16
            0x00000001158ef74c:   rbit	x11, x11
            0x00000001158ef750:   clz	x11, x11
            0x00000001158ef754:   lsr	w11, w11, #3
           ;; 0x4
            0x00000001158ef758:   orr	w8, wzr, #0x4
            0x00000001158ef75c:   cmp	w11, w8
            0x00000001158ef760:   csel	w11, w8, w11, ge  // ge = tcont
``` 
to

            0x0000000115f3f8e8:   fmov	x14, d16
            0x0000000115f3f8ec:   orr	x14, x14, #0x100000000
            0x0000000115f3f8f0:   rbit	x14, x14
            0x0000000115f3f8f4:   clz	x14, x14
            0x0000000115f3f8f8:   lsr	w14, w14, #3

That's a pretty decent speedup when you consider that the benchmark is dominated by memory operations and vector->core register moves.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14373#issuecomment-1601008191