RFR: 8291118: [vectorapi] Optimize the implementation of lanewise FIRST_NONZERO

Fri Jul 29 04:09:05 UTC 2022

Vector API binary op "`FIRST_NONZERO`" represents the vector operation of "`a != 0 ? a : b`", which can be implemented with existing APIs like "`compare + blend`". The current implementation is more complex especially for the floating point type vectors. The main idea is:

1) mask = a.compare(0, ne);
2) b = b.blend(0, mask);
3) result = a | b;

And for the floating point types, it needs the vector reinterpretation between the floating point type and the relative integral type, since the final "`OR`" operation is only valid for bitwise integral types.

A simpler implementation is:

1) mask = a.compare(0, eq);
2) result = a.blend(b, mask);

This could save the final "`OR`" operation and the related reinterpretation between FP and integral types.

Here are the performance data of the "`FIRST_NONZERO`" benchmarks (please see the benchmark details for byte vector from [1]) on ARM NEON system:

Benchmark                          (size) Mode  Cnt  Before    After    Units
ByteMaxVector.FIRST_NONZERO         1024  thrpt  15 12107.422 18385.157 ops/ms
ByteMaxVector.FIRST_NONZEROMasked   1024  thrpt  15  9765.282 14739.775 ops/ms
DoubleMaxVector.FIRST_NONZERO       1024  thrpt  15  1798.545  2331.214 ops/ms
DoubleMaxVector.FIRST_NONZEROMasked 1024  thrpt  15  1211.838  1810.644 ops/ms
FloatMaxVector.FIRST_NONZERO        1024  thrpt  15  3491.924  4377.167 ops/ms
FloatMaxVector.FIRST_NONZEROMasked  1024  thrpt  15  2307.085  3606.576 ops/ms
IntMaxVector.FIRST_NONZERO          1024  thrpt  15  3602.727  5610.258 ops/ms
IntMaxVector.FIRST_NONZEROMasked    1024  thrpt  15  2726.843  4210.741 ops/ms
LongMaxVector.FIRST_NONZERO         1024  thrpt  15  1819.886  2974.655 ops/ms
LongMaxVector.FIRST_NONZEROMasked   1024  thrpt  15  1337.737  2315.094 ops/ms
ShortMaxVector.FIRST_NONZERO        1024  thrpt  15  6603.642  9586.320 ops/ms
ShortMaxVector.FIRST_NONZEROMasked  1024  thrpt  15  5222.006  7991.443 ops/ms

We can also observe the similar improvement on x86 system.

[1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L266

-------------

Commit messages:
 - 8291118: [vectorapi] Optimize the implementation of lanewise FIRST_NONZERO

Changes: https://git.openjdk.org/jdk/pull/9683/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9683&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8291118
  Stats: 86 lines in 7 files changed: 9 ins; 38 del; 39 mod
  Patch: https://git.openjdk.org/jdk/pull/9683.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/9683/head:pull/9683

PR: https://git.openjdk.org/jdk/pull/9683