Performance of two-vector shuffles/rearranges
Kai Burjack
kburjack at googlemail.com
Fri Sep 17 15:12:32 UTC 2021
Hi,
I'd like to get some news about how the performance optimizations for
two-vector shuffles/rearrange are going or whether there are any plans
to improve the performance of these.
What I am talking about is code like:
```C
// pick lane 0 and 1 from row0 and lane 0 and 1 from row1
_mm_shuffle_ps(row0, row1, 0x44)
// pick lane 2 and 3 from row0 and lane 2 and 3 from row1
_mm_shuffle_ps(row0, row1, 0xEE);
```
which I translated to Vector API like this:
```Java
// pick lane 0 and 1 from row0 and lane 0 and 1 from row1
row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(0, 1, 4, 5));
// pick lane 2 and 3 from row0 and lane 2 and 3 from row1
row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(2, 3, 6, 7));
```
The C compiler generates optimal SSE instructions for the above intrinsics
while the Java translation of mine is about two orders of magnitudes
slower. Also, putting the result of the SPECIES_128.shuffleFromValues()
calls into private static final variables and using these variables in the
rearrange calls gives a performance increase of ~30%, which is rather odd,
since I would have assumed that everything is nicely inlined and optimized.
My actual use-case for the vector shuffles is computing 4x4 matrix inverse
and transpose.
I've also written some benchmarks in
https://github.com/JOML-CI/panama-vector-bench to reflect my use-cases.
I am constantly testing against the current git tip of the vectorIntrinsics
and vectorIntrinsics+mask branches in the GitHub repo.
Are the rearrange patterns planned to be optimized in the future?
Thanks!
Kai
More information about the panama-dev
mailing list