Performance of two-vector shuffles/rearranges

Fri Sep 17 15:15:28 UTC 2021

Sorry, the Java code from the previous mail:
```Java
// pick lane 0 and 1 from row0 and lane 0 and 1 from row1
row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(0, 1, 4, 5));
// pick lane 2 and 3 from row0 and lane 2 and 3 from row1
row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(2, 3, 6, 7));
```
should read as:
```Java
// pick lane 0 and 1 from row0 and lane 0 and 1 from row1
col0.rearrange(SPECIES_128.shuffleFromValues(0, 1, 4, 5), row1);
// pick lane 2 and 3 from row0 and lane 2 and 3 from row1
col0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 6, 7), row1);
```
This is the actual code (also in the benchmark repo).

On Fri, Sep 17, 2021 at 5:12 PM Kai Burjack <kburjack at googlemail.com> wrote:

> Hi,
>
> I'd like to get some news about how the performance optimizations for
> two-vector shuffles/rearrange are going or whether there are any plans
> to improve the performance of these.
>
> What I am talking about is code like:
> ```C
> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
> _mm_shuffle_ps(row0, row1, 0x44)
> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
> _mm_shuffle_ps(row0, row1, 0xEE);
> ```
>
> which I translated to Vector API like this:
> ```Java
> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(0, 1, 4, 5));
> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(2, 3, 6, 7));
> ```
>
> The C compiler generates optimal SSE instructions for the above intrinsics
> while the Java translation of mine is about two orders of magnitudes
> slower. Also, putting the result of the SPECIES_128.shuffleFromValues()
> calls into private static final variables and using these variables in the
> rearrange calls gives a performance increase of ~30%, which is rather odd,
> since I would have assumed that everything is nicely inlined and optimized.
>
> My actual use-case for the vector shuffles is computing 4x4 matrix inverse
> and transpose.
>
> I've also written some benchmarks in
> https://github.com/JOML-CI/panama-vector-bench to reflect my use-cases.
>
> I am constantly testing against the current git tip of the
> vectorIntrinsics and vectorIntrinsics+mask branches in the GitHub repo.
>
> Are the rearrange patterns planned to be optimized in the future?
>
> Thanks!
> Kai
>