RFR: 8310691: [REDO] [vectorapi] Refactor VectorShuffle implementation [v2]

Tue Oct 29 20:48:27 UTC 2024

On Sun, 6 Oct 2024 08:32:20 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly just the revert of the backout.
>> 
>> Regarding the related issues:
>> 
>> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed before the backout.
>> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to missing `ForceInline` on `AbstractVector::toBitsVectorTemplate`
>> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not been able to find the root causes. I'm not sure if this is a blocker, now I cannot even build x86-32 tests.
>> 
>> Finally, I moved some implementation of public methods and methods that call into intrinsics to the concrete class as that may help the compiler know the correct types of the variables.
>> 
>> Please take a look and leave reviews. Thanks a lot.
>> 
>> The description of the original PR:
>> 
>> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, `VectorShuffle` is stored as a byte array, and would be expanded upon usage. This poses several drawbacks:
>> 
>> Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically.
>> Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations.
>> Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler.
>> Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones.
>> Upon these changes, a `rearrange` can emit more efficient code:
>> 
>>     var species = IntVector.SPECIES_128;
>>     var v1 = IntVector.fromArray(species, SRC1, 0);
>>     var v2 = IntVector.fromArray(species, SRC2, 0);
>>     v1.rearrange(v2.toShuffle()).intoArray(DST, 0);
>> 
>>     Before:
>>     movabs $0x751589fa8,%r10            ;   {oop([I{0x0000000751589fa8})}
>>     vmovdqu 0x10(%r10),%xmm2
>>     movabs $0x7515a0d08,%r10            ;   {oop([I{0x00000007515a0d08})}
>>     vmovdqu 0x10(%r10),%xmm1
>>     movabs $0x75158afb8,%r10            ;   {oop([I{0x000000075158afb8})}
>>     vmovdqu 0x10(%r10),%xmm0
>>     vpand  -0xddc12(%rip),%xmm0,%xmm0        # Stub::vector_int_to_byt...
>
> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit:
> 
>   [vectorapi] Refactor VectorShuffle implementation

This is a nice change. It's as if `VectorShuffle<E>` has a payload of `Vector<F>` of the same shape as the shuffle and where` F` is the bit size equivalent integral type of `E`, and where the lane elements of the vector are constrained to be within `[-VLENGTH, VLENGTH-1]` (I do wonder if we might be able to refactor towards that more explicit representation later on with Valhalla.)

That simplifies things and opens up more optimizations and complements the modifications we recently did to `rearrange`/`selectFrom`.

(Recommend you do a merge with master to get latest Vector API changes just in case there is some impact.)

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractVector.java line 228:

> 226:         }
> 227: 
> 228:         AbstractVector<?> iota = vspecies().asIntegral().iota();

I suspect the non-power of two code is more efficient. (Even better if the MUL could be transformed to a shift for power of two values.)

Separately, it makes me wonder if we should revisit the shuffle factories if it is now much more efficient to construct a shuffle from a vector.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int256Vector.java line 870:

> 868:         @Override
> 869:         public final Int256Shuffle rearrange(VectorShuffle<Integer> shuffle) {
> 870:             return (Int256Shuffle) toBitsVector().rearrange(((Int256Shuffle) shuffle)

I think the cast is redundant for all vector kinds. Similarly the explicit cast is redundant for the integral vectors, perhaps in the template separate out the expressions to avoid it where not needed?

We could also refer to `VSPECIES` directly rather than calling `vspecies()`, same applies in other methods in the concrete vector classes.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Int256Vector.java line 908:

> 906:         }
> 907: 
> 908:         private static boolean indicesInRange(int[] indices) {

Since this method is only called from an assert statement in the constructor we could avoid the clever checking that assertions are enabled and the explicit throwing on an AssertionError by using a second expression that produces an error message when the assertion fails : e.g.,

            assert indicesInRange(indices) : outOfBoundsAssertMessage(indices);

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java line 2392:

> 2390:             this, shuffle, null,
> 2391:             (v1, s_, m_) -> v1.uOp((i, a) -> {
> 2392:                 int ei = Integer.remainderUnsigned(s_.laneSource(i), v1.length());

Note to self - the intrinsic performs the wrapping of shuffle values using bitwise AND.
Nice use of method (equiv to `Math.floorMod` for the range on input arguments).

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java line 2473:

> 2471:     final <F>
> 2472:     VectorShuffle<F> toShuffle(AbstractSpecies<F> dsp, boolean wrap) {
> 2473:         assert(dsp.elementSize() == vspecies().elementSize());

Even though we force inline I cannot quite decide if it is better to forego the assert since it unduly increases method size. Regardless it may be useful to place the partial wrapping logic in a separate method, given it is less likely to be used.

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template line 1150:

> 1148:         @Override
> 1149:         @ForceInline
> 1150:         public void intoArray(int[] a, int offset) {

Separately, we might consider optimizing `shuffleFromArray`.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21042#pullrequestreview-2402948659
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821489034
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821471669
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821478372
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821450485
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821456333
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1821501842