[vectorIntrinsics] RFR: Improve mask reduction operations on AVX
Mai Đặng Quân Anh
duke at openjdk.java.net
Tue Nov 2 16:22:22 UTC 2021
On Tue, 2 Nov 2021 10:58:19 GMT, Mai Đặng Quân Anh <duke at openjdk.java.net> wrote:
> Hi,
> This patch improves the logic of vector mask reduction operations on AVX, especially int, float, long, double, by using vmovmskpd and vmovmskps instructions. I also do a little refactoring to reduce duplication in toLong. The patch temporarily disables these operations on Neon, though.
> Thank you very much.
Microbenchmark shows significant improvement on my AVX2 machine.
Before:
Benchmark Mode Cnt Score Error Units
MaskReduction.byte128 avgt 25 0.773 ± 0.006 ns/op
MaskReduction.byte256 avgt 25 0.778 ± 0.007 ns/op
MaskReduction.int128 avgt 25 1.061 ± 0.008 ns/op
MaskReduction.int256 avgt 25 1.553 ± 0.010 ns/op
MaskReduction.long128 avgt 25 43.008 ± 0.354 ns/op
MaskReduction.long256 avgt 25 1.271 ± 0.006 ns/op
MaskReduction.short128 avgt 25 0.989 ± 0.006 ns/op
MaskReduction.short256 avgt 25 0.919 ± 0.005 ns/op
After:
Benchmark Mode Cnt Score Error Units
MaskReduction.byte128 avgt 25 0.566 ± 0.001 ns/op
MaskReduction.byte256 avgt 25 0.556 ± 0.003 ns/op
MaskReduction.int128 avgt 25 0.553 ± 0.003 ns/op
MaskReduction.int256 avgt 25 0.828 ± 0.002 ns/op
MaskReduction.long128 avgt 25 41.618 ± 0.241 ns/op
MaskReduction.long256 avgt 25 0.552 ± 0.001 ns/op
MaskReduction.short128 avgt 25 0.775 ± 0.004 ns/op
MaskReduction.short256 avgt 25 0.834 ± 0.006 ns/op
The benchmark simply loads 2 constant vector, compares them and returns the toLong of the result as follow:
return BYTE_128_1.eq(BYTE_128_2).toLong();
-------------
PR: https://git.openjdk.java.net/panama-vector/pull/158
More information about the panama-dev
mailing list