RFR: 8310691: [REDO] [vectorapi] Refactor VectorShuffle implementation [v8]

Jatin Bhateja jbhateja at openjdk.org
Tue Dec 10 07:50:46 UTC 2024


On Mon, 9 Dec 2024 14:12:19 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly just the revert of the backout.
>> 
>> Regarding the related issues:
>> 
>> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed before the backout.
>> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to missing `ForceInline` on `AbstractVector::toBitsVectorTemplate`
>> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not been able to find the root causes. I'm not sure if this is a blocker, now I cannot even build x86-32 tests.
>> 
>> Finally, I moved some implementation of public methods and methods that call into intrinsics to the concrete class as that may help the compiler know the correct types of the variables.
>> 
>> Please take a look and leave reviews. Thanks a lot.
>> 
>> The description of the original PR:
>> 
>> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, `VectorShuffle` is stored as a byte array, and would be expanded upon usage. This poses several drawbacks:
>> 
>> Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically.
>> Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations.
>> Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler.
>> Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones.
>> Upon these changes, a `rearrange` can emit more efficient code:
>> 
>>     var species = IntVector.SPECIES_128;
>>     var v1 = IntVector.fromArray(species, SRC1, 0);
>>     var v2 = IntVector.fromArray(species, SRC2, 0);
>>     v1.rearrange(v2.toShuffle()).intoArray(DST, 0);
>> 
>>     Before:
>>     movabs $0x751589fa8,%r10            ;   {oop([I{0x0000000751589fa8})}
>>     vmovdqu 0x10(%r10),%xmm2
>>     movabs $0x7515a0d08,%r10            ;   {oop([I{0x00000007515a0d08})}
>>     vmovdqu 0x10(%r10),%xmm1
>>     movabs $0x75158afb8,%r10            ;   {oop([I{0x000000075158afb8})}
>>     vmovdqu 0x10(%r10),%xmm0
>>     vpand  -0xddc12(%rip),%xmm0,%xmm0        # Stub::vector_int_to_byt...
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   adverb order

Hi @merykitty ,

Nice work!,  over all looks good to me with some mincro comments.

Kindly address.

src/hotspot/share/opto/library_call.hpp line 358:

> 356:   bool inline_vector_shuffle_to_vector();
> 357:   bool inline_vector_wrap_shuffle_indexes();
> 358:   bool inline_vector_shuffle_iota();

FTR, x86 ISA does not support a direct byte multiplier instruction, so we first unpack to a short vector, multiply at a short granularity, and then pack it back to byte vector. This was somewhat costly since now shuffle backing storage matches the lane size of the corresponding vector. Hence, computation with a non-unit scalar should improve.

src/hotspot/share/opto/vectornode.hpp line 1691:

> 1689: };
> 1690: 
> 1691: // The machine may not directly support the rearrange operation of an element type. In those cases,

``
Suggestion:

// The target may not directly support the rearrange operation for an element type. In those cases,

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java line 822:

> 820:         static final Class<Byte> ETYPE = byte.class; // used by the JVM
> 821: 
> 822:         Byte128Shuffle(byte[] indices) {

We still cannot accommodate all the indexes for the 2048 bit scalable vector for ARM SVE. Max index accommodable is 127 since byte is a signed type with value range b/w [-128 , 127].

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte512Vector.java line 1046:

> 1044:                     String msg = ("index "+si+"out of range ["+length+"] in "+
> 1045:                                   java.util.Arrays.toString(indices));
> 1046:                     throw new AssertionError(msg);

Why not directly throw IndexOutOfBoundsException here?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Double128Vector.java line 859:

> 857:                         .reinterpretAsInts()
> 858:                         .intoArray(a, offset);
> 859:                 default -> {

These cases for length() = 4, 8, and 16 looks redundant for 128-bit DeoubleVector.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21042#pullrequestreview-2491206953
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877513236
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877491672
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877459622
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877452008
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877466991


More information about the hotspot-compiler-dev mailing list