Vector API - Performance of Vector.rearrange() with selecting lanes from two vectors

Paul Sandoz paul.sandoz at oracle.com
Tue Mar 2 16:24:19 UTC 2021


Hi Kaj,

Thanks for the further investigations.

On Mar 2, 2021, at 7:34 AM, Kai Burjack <kburjack at googlemail.com<mailto:kburjack at googlemail.com>> wrote:

Hi,

I was just checking on the performance of shuffle operations, and noticed
that the solution proposed in
https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/009302.html has
been implemented, making single-vector shuffle/rearrange quite fast now,
thanks!


Yes:

https://github.com/openjdk/jdk16/commit/50bf4330

There is some more work to do here. We are pondering modifying the specification of rearrange to wrap exception indexes, thereby avoiding checks altogether.

See https://bugs.openjdk.java.net/browse/JDK-8261663


The two-vector overload is not yet optimized e.g. VectorShuffle.laneIsValid and VectorShuffle.wrapIndexes.

It’s a composition of two rearranges blended with the shuffle’s valid lane mask, so it should be possible to get it to work outside with a constant shuffle and mask.

If we modify the specification to automatically wrap this arguably gets a little simpler, but we still need to extract a mask from the shuffle. Or otherwise push this down as an intrinsic.

Paul.

Next, I was implementing 4x4 matrix inversion with Vector API and hit
another performance roadblock. Essentially, I needed some way to use SSE's
MOVHLPS and MOVLHPS which is emitted for example by LLVM's
`__builtin_shufflevector(v0, v1, 0, 4, 1, 5)`.

While searching for an alternative on how to do this with Vector API, I
discovered the two-vector overload of rearrange() and came to the
conclusion that the above LLVM builtin could be expressed via Vector API by
using `v0.rearrange(SPECIES_128.shuffleFromValues(0, 4, 1, 5), v1)`,
however there is a rather big performance overhead in using that two-vector
overload. In particular, when also using indexes to select lanes from the
second vector argument, which in the documentation of rearrange() is called
"exceptional indexes".

I've argumented my Vector API benchmark JMH project with this:
https://github.com/JOML-CI/panama-vector-bench/blob/d68de7733a859b38aacfb0a4c47e2998370bf9da/src/bench/Matrix4fvArr.java#L95-L232

This shows a runtime of ~120ns. per invocation whereas a scalar version of
a 4x4 matrix invert() took ~21ns. per invocation.
See:
https://github.com/JOML-CI/panama-vector-bench/blob/d68de7733a859b38aacfb0a4c47e2998370bf9da/src/bench/Matrix4f.java#L162-L210

Benchmarks for both methods:
https://github.com/JOML-CI/panama-vector-bench/blob/d68de7733a859b38aacfb0a4c47e2998370bf9da/src/bench/Bench.java#L32-L40

Even when not using "exceptional" indexes, this two-vector overload of
rearrange() is rather slow.
I compared this:

`v0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 0, 1), v0)`

to the equivalent form:

`v0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 0, 1))`

and there was also a rather big performance difference.

I was just wondering whether the two-vector overload of rearrange() hasn't
yet been optimized?

Thanks,
Kai.



More information about the panama-dev mailing list