RFR: 8309583: AArch64: Optimize firstTrue() when amount of elements < 8 [v3]

Andrew Haley aph at openjdk.org
Wed Jun 21 16:53:02 UTC 2023


On Mon, 19 Jun 2023 02:06:27 GMT, Chang Peng <duke at openjdk.org> wrote:

>> This patch optimizes VectorMask.firstTrue() on Neon when there are 2 or 4 elements in vector registers.
>> 
>> VectorMask.firstTrue() should return VLEGNTH when vector mask is all false [1]. Current implementation uses rbit and then clz [2] to count leading zeros, then uses csel [3] (conditional select) to get the smaller value between VLENGTH and the number of unset lanes to ensure correctness.
>> 
>> This patch sets the 16th or 32nd bit as 1, when there are only 2 or 4 elements in boolean masks, before rbit and clz. With this trick, maximum value calculated in such case will be VLENGTH (2 or 4).
>> 
>> Test:
>> All vector and vectorapi test passed.
>> 
>> Performance:
>> The benchmark functions are in MaskQueryOperationsBenchmark.java [4]. This patch also modifies above benchmark to measure mask operations' performance more effectively.
>> 
>> Following data is collected on a 128-bit Neon machine.
>> 
>> Benchmark                                                                        (inputs)   Mode  Before    After    Units
>> MaskQueryOperationsBenchmark.testFirstTrueInt            1            thrpt  5952.670  7298.491 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt            2            thrpt  5951.513  7297.620 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueInt            3            thrpt  5953.048  7298.072 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong        1            thrpt  3496.990  4003.188 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong        2            thrpt  3497.755  4002.577 ops/ms
>> MaskQueryOperationsBenchmark.testFirstTrueLong        3            thrpt  3500.085  4002.471 ops/ms
>> 
>> [1]: https://docs.oracle.com/en/java/javase/20/docs/api/jdk.incubator.vector/jdk/incubator/vector/VectorMask.html#firstTrue()
>> [2]: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L5540
>> [3]: https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions/CSEL--Conditional-Select-
>
> Chang Peng has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update MaskQueryOperationsBenchmark.java

If we care about memory ops, note that we can get a useful speedup with `match(Set dst (VectorMaskFirstTrue (LoadVector mem)))` but perhaps that's not worth doing.


Benchmark                                           (inputs)  Mode  Cnt   Score   Error  Units
MaskQueryOperationsBenchmark.testFirstTrueInt              1  avgt    3  49.591 ± 0.477  ns/op



I will say that in general if you have to work in the core integer processor on an in-memory vector , it might be worth loading straight into core registers rather than going via the SIMD regs. Maybe we should write a general-purpose function that bypasses the SIMD unit in all such cases.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14373#issuecomment-1601216501


More information about the hotspot-compiler-dev mailing list