Vector API - Performance of Vector.rearrange() with selecting lanes from two vectors

Kai Burjack kburjack at googlemail.com
Tue Mar 2 15:34:27 UTC 2021


Hi,

I was just checking on the performance of shuffle operations, and noticed
that the solution proposed in
https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/009302.html has
been implemented, making single-vector shuffle/rearrange quite fast now,
thanks!

Next, I was implementing 4x4 matrix inversion with Vector API and hit
another performance roadblock. Essentially, I needed some way to use SSE's
MOVHLPS and MOVLHPS which is emitted for example by LLVM's
`__builtin_shufflevector(v0, v1, 0, 4, 1, 5)`.

While searching for an alternative on how to do this with Vector API, I
discovered the two-vector overload of rearrange() and came to the
conclusion that the above LLVM builtin could be expressed via Vector API by
using `v0.rearrange(SPECIES_128.shuffleFromValues(0, 4, 1, 5), v1)`,
however there is a rather big performance overhead in using that two-vector
overload. In particular, when also using indexes to select lanes from the
second vector argument, which in the documentation of rearrange() is called
"exceptional indexes".

I've argumented my Vector API benchmark JMH project with this:
https://github.com/JOML-CI/panama-vector-bench/blob/d68de7733a859b38aacfb0a4c47e2998370bf9da/src/bench/Matrix4fvArr.java#L95-L232

This shows a runtime of ~120ns. per invocation whereas a scalar version of
a 4x4 matrix invert() took ~21ns. per invocation.
See:
https://github.com/JOML-CI/panama-vector-bench/blob/d68de7733a859b38aacfb0a4c47e2998370bf9da/src/bench/Matrix4f.java#L162-L210

Benchmarks for both methods:
https://github.com/JOML-CI/panama-vector-bench/blob/d68de7733a859b38aacfb0a4c47e2998370bf9da/src/bench/Bench.java#L32-L40

Even when not using "exceptional" indexes, this two-vector overload of
rearrange() is rather slow.
I compared this:

`v0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 0, 1), v0)`

to the equivalent form:

`v0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 0, 1))`

and there was also a rather big performance difference.

I was just wondering whether the two-vector overload of rearrange() hasn't
yet been optimized?

Thanks,
Kai.


More information about the panama-dev mailing list