Vector API performance of two-vector rearrange() overload

Paul Sandoz paul.sandoz at oracle.com
Mon Apr 26 23:49:52 UTC 2021


Hi Kai,

[Apologies for the delay in replying. Emails from non-members were queued up for approval and we forgot to approve ‘em]

The shuffle+vector accepting rearrange is composed of the following:

var valid = shuffle.laneIsValid();
var s = shuffle.wrapIndexes();
var r0 = this.rearrange(s);
var r1 = v.rearrange(s);
return r1.blend(r0, valid);

We recently (in March) improved Shuffle.laneIsValid() and Shuffle.wrapIndexes (but not for exceptional indexes). It should generate good code for non-exceptional indexes but 
that still will likely generate more instructions that you would prefer for your case.
 
C2 currently does not support detecting patterns of use with constant shuffle values (see also recent email on horizontal add). In theory we could detect patterns, but I worry about the fragility of doing so with very specific specialized code. Perhaps it's easier in this case than that of the horizontal add case. Rearrange is definitely powerful, it could be more so if we could be more sophisticated about detecting common shuffle patterns input to it.

Paul.

> On Feb 12, 2021, at 2:12 AM, Kai Burjack <kburjack at googlemail.com> wrote:
> 
> I was just checking on the performance of shuffle operations, and noticed
> that the solution proposed in
> https://mail.openjdk.java.net/pipermail/panama-dev/2020-May/009302.html has
> been implemented, making single-vector shuffle/rearrange quite fast now,
> thanks!
> 
> Next, I was implementing 4x4 matrix inversion with Vector API and hit
> another performance roadblock. Essentially, I needed some way to use SSE's
> MOVHLPS and MOVLHPS which is emitted for example by LLVM's
> `__builtin_shufflevector(v0, v1, 0, 4, 1, 5)`.
> 
> While searching for an alternative on how to do this with Vector API, I
> discovered the two-vector overload of rearrange() and came to the
> conclusion that the above LLVM builtin could be expressed via Vector API by
> using `v0.rearrange(SPECIES_128.shuffleFromValues(0, 4, 1, 5), v1)`,
> however there is a HUGE performance overhead in using that two-vector
> overload. In particular, when also using indexes to actually select lanes
> from the second vector argument, which in the documentation of rearrange()
> is called "exceptional indexes".
> 
> Even when not using "exceptional" indexes, this two-vector overload of
> rearrange() is rather slow.
> I compared this:
> 
> `v0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 0, 1), v0)`
> 
> to the equivalent form:
> 
> `v0.rearrange(SPECIES_128.shuffleFromValues(2, 3, 0, 1))`
> 
> and there was also a rather big performance difference.
> 
> I was just wondering whether the two-vector overload of rearrange() hasn't
> yet seen a fast intrinsification with MOVHLPS/MOVLHPS?
> 
> Thanks,
> Kai.



More information about the panama-dev mailing list