Performance of two-vector shuffles/rearranges

Fri Sep 17 21:13:03 UTC 2021

Hi Kai,

We made some progress in 17 but there are still some gaps that we will eventually need to fill. Current priority is completing mask support.

- As noted you have to place the shuffle in a static final field to be a constant shuffle. There is no intrinsic shuffle creation and further to determine a constant shuffle we would need to ensure all the shuffle elements are constant too.

- VectorShuffle.wrapIndexes is not fully optimized if there are any exceptional values, that’s the main weakness.

I think we can eventually get close to what you want, but it might be tricky to it down to two instructions. The implementation is currently composed of two rearranges and a blend. 

Paul.

> On Sep 17, 2021, at 8:15 AM, Kai Burjack <kburjack at googlemail.com> wrote:
> 
> Sorry, the Java code from the previous mail:
> ```Java
> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(0, 1, 4, 5));
> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(2, 3, 6, 7));
> ```
> should read as:
> ```Java
> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
> col0.rearrange(SPECIES_128.shuffleFromValues(0, 1, 4, 5), row1);
> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
> col0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 6, 7), row1);
> ```
> This is the actual code (also in the benchmark repo).
> 
> On Fri, Sep 17, 2021 at 5:12 PM Kai Burjack <kburjack at googlemail.com> wrote:
> 
>> Hi,
>> 
>> I'd like to get some news about how the performance optimizations for
>> two-vector shuffles/rearrange are going or whether there are any plans
>> to improve the performance of these.
>> 
>> What I am talking about is code like:
>> ```C
>> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
>> _mm_shuffle_ps(row0, row1, 0x44)
>> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
>> _mm_shuffle_ps(row0, row1, 0xEE);
>> ```
>> 
>> which I translated to Vector API like this:
>> ```Java
>> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
>> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(0, 1, 4, 5));
>> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
>> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(2, 3, 6, 7));
>> ```
>> 
>> The C compiler generates optimal SSE instructions for the above intrinsics
>> while the Java translation of mine is about two orders of magnitudes
>> slower. Also, putting the result of the SPECIES_128.shuffleFromValues()
>> calls into private static final variables and using these variables in the
>> rearrange calls gives a performance increase of ~30%, which is rather odd,
>> since I would have assumed that everything is nicely inlined and optimized.
>> 
>> My actual use-case for the vector shuffles is computing 4x4 matrix inverse
>> and transpose.
>> 
>> I've also written some benchmarks in
>> https://github.com/JOML-CI/panama-vector-bench to reflect my use-cases.
>> 
>> I am constantly testing against the current git tip of the
>> vectorIntrinsics and vectorIntrinsics+mask branches in the GitHub repo.
>> 
>> Are the rearrange patterns planned to be optimized in the future?
>> 
>> Thanks!
>> Kai
>>