Performance of two-vector shuffles/rearranges

Fri Sep 17 15:12:32 UTC 2021

Hi,

I'd like to get some news about how the performance optimizations for
two-vector shuffles/rearrange are going or whether there are any plans
to improve the performance of these.

What I am talking about is code like:
```C
// pick lane 0 and 1 from row0 and lane 0 and 1 from row1
_mm_shuffle_ps(row0, row1, 0x44)
// pick lane 2 and 3 from row0 and lane 2 and 3 from row1
_mm_shuffle_ps(row0, row1, 0xEE);
```

which I translated to Vector API like this:
```Java
// pick lane 0 and 1 from row0 and lane 0 and 1 from row1
row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(0, 1, 4, 5));
// pick lane 2 and 3 from row0 and lane 2 and 3 from row1
row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(2, 3, 6, 7));
```

The C compiler generates optimal SSE instructions for the above intrinsics
while the Java translation of mine is about two orders of magnitudes
slower. Also, putting the result of the SPECIES_128.shuffleFromValues()
calls into private static final variables and using these variables in the
rearrange calls gives a performance increase of ~30%, which is rather odd,
since I would have assumed that everything is nicely inlined and optimized.

My actual use-case for the vector shuffles is computing 4x4 matrix inverse
and transpose.

I've also written some benchmarks in
https://github.com/JOML-CI/panama-vector-bench to reflect my use-cases.

I am constantly testing against the current git tip of the vectorIntrinsics
and vectorIntrinsics+mask branches in the GitHub repo.

Are the rearrange patterns planned to be optimized in the future?

Thanks!
Kai