Performance of two-vector shuffles/rearranges
Paul Sandoz
paul.sandoz at oracle.com
Fri Sep 17 21:13:03 UTC 2021
Hi Kai,
We made some progress in 17 but there are still some gaps that we will eventually need to fill. Current priority is completing mask support.
- As noted you have to place the shuffle in a static final field to be a constant shuffle. There is no intrinsic shuffle creation and further to determine a constant shuffle we would need to ensure all the shuffle elements are constant too.
- VectorShuffle.wrapIndexes is not fully optimized if there are any exceptional values, that’s the main weakness.
I think we can eventually get close to what you want, but it might be tricky to it down to two instructions. The implementation is currently composed of two rearranges and a blend.
Paul.
> On Sep 17, 2021, at 8:15 AM, Kai Burjack <kburjack at googlemail.com> wrote:
>
> Sorry, the Java code from the previous mail:
> ```Java
> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(0, 1, 4, 5));
> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(2, 3, 6, 7));
> ```
> should read as:
> ```Java
> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
> col0.rearrange(SPECIES_128.shuffleFromValues(0, 1, 4, 5), row1);
> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
> col0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 6, 7), row1);
> ```
> This is the actual code (also in the benchmark repo).
>
> On Fri, Sep 17, 2021 at 5:12 PM Kai Burjack <kburjack at googlemail.com> wrote:
>
>> Hi,
>>
>> I'd like to get some news about how the performance optimizations for
>> two-vector shuffles/rearrange are going or whether there are any plans
>> to improve the performance of these.
>>
>> What I am talking about is code like:
>> ```C
>> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
>> _mm_shuffle_ps(row0, row1, 0x44)
>> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
>> _mm_shuffle_ps(row0, row1, 0xEE);
>> ```
>>
>> which I translated to Vector API like this:
>> ```Java
>> // pick lane 0 and 1 from row0 and lane 0 and 1 from row1
>> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(0, 1, 4, 5));
>> // pick lane 2 and 3 from row0 and lane 2 and 3 from row1
>> row0.rearrange(s0145, row1, SPECIES_128.shuffleFromValues(2, 3, 6, 7));
>> ```
>>
>> The C compiler generates optimal SSE instructions for the above intrinsics
>> while the Java translation of mine is about two orders of magnitudes
>> slower. Also, putting the result of the SPECIES_128.shuffleFromValues()
>> calls into private static final variables and using these variables in the
>> rearrange calls gives a performance increase of ~30%, which is rather odd,
>> since I would have assumed that everything is nicely inlined and optimized.
>>
>> My actual use-case for the vector shuffles is computing 4x4 matrix inverse
>> and transpose.
>>
>> I've also written some benchmarks in
>> https://github.com/JOML-CI/panama-vector-bench to reflect my use-cases.
>>
>> I am constantly testing against the current git tip of the
>> vectorIntrinsics and vectorIntrinsics+mask branches in the GitHub repo.
>>
>> Are the rearrange patterns planned to be optimized in the future?
>>
>> Thanks!
>> Kai
>>
More information about the panama-dev
mailing list