RFR: 8310691: [REDO] [vectorapi] Refactor VectorShuffle implementation [v6]

Emanuel Peter epeter at openjdk.org
Mon Dec 9 08:21:45 UTC 2024


On Fri, 6 Dec 2024 17:00:15 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly just the revert of the backout.
>> 
>> Regarding the related issues:
>> 
>> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed before the backout.
>> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to missing `ForceInline` on `AbstractVector::toBitsVectorTemplate`
>> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not been able to find the root causes. I'm not sure if this is a blocker, now I cannot even build x86-32 tests.
>> 
>> Finally, I moved some implementation of public methods and methods that call into intrinsics to the concrete class as that may help the compiler know the correct types of the variables.
>> 
>> Please take a look and leave reviews. Thanks a lot.
>> 
>> The description of the original PR:
>> 
>> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, `VectorShuffle` is stored as a byte array, and would be expanded upon usage. This poses several drawbacks:
>> 
>> Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically.
>> Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations.
>> Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler.
>> Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones.
>> Upon these changes, a `rearrange` can emit more efficient code:
>> 
>>     var species = IntVector.SPECIES_128;
>>     var v1 = IntVector.fromArray(species, SRC1, 0);
>>     var v2 = IntVector.fromArray(species, SRC2, 0);
>>     v1.rearrange(v2.toShuffle()).intoArray(DST, 0);
>> 
>>     Before:
>>     movabs $0x751589fa8,%r10            ;   {oop([I{0x0000000751589fa8})}
>>     vmovdqu 0x10(%r10),%xmm2
>>     movabs $0x7515a0d08,%r10            ;   {oop([I{0x00000007515a0d08})}
>>     vmovdqu 0x10(%r10),%xmm1
>>     movabs $0x75158afb8,%r10            ;   {oop([I{0x000000075158afb8})}
>>     vmovdqu 0x10(%r10),%xmm0
>>     vpand  -0xddc12(%rip),%xmm0,%xmm0        # Stub::vector_int_to_byt...
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   add comment, extract cast into local variable

@merykitty looks quite reasonable, though I only looked at the VM changes, only scanned the Java library code.

src/hotspot/cpu/x86/x86.ad line 2215:

> 2213: 
> 2214: // Return true if Vector::rearrange needs preparation of the shuffle argument
> 2215: bool Matcher::vector_needs_load_shuffle(BasicType elem_bt, int vlen) {

I commented on this before. This needs to have a more expressive name. Is it just about rearrange? Because now it sounds like maybe all vectors may need a shuffle. Or just all loads? Confusing.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1757:

> 1755: 
> 1756:   int num_elem = vlen->get_con();
> 1757:   bool need_load_shuffle = Matcher::vector_needs_load_shuffle(shuffle_bt, num_elem);

Maybe it could be renamed to `vector_rearrange_requires_load_shuffle`?

src/hotspot/share/opto/vectorIntrinsics.cpp line 1800:

> 1798:   Node* v1 = unbox_vector(argument(5), vbox_type, elem_bt, num_elem);
> 1799:   Node* shuffle = unbox_vector(argument(6), shbox_type, shuffle_bt, num_elem);
> 1800:   const TypeVect* vt = TypeVect::make(elem_bt, num_elem);

Is this one used?

src/hotspot/share/opto/vectornode.hpp line 1694:

> 1692: // we can transform the rearrange into a different element type. For example, on x86 before AVX512,
> 1693: // there is no rearrange instruction for short elements, what we will then do is to transform the
> 1694: // shuffle vector into 1 that we can do byte rearrange such that it would provide the same result.

Suggestion:

// shuffle vector into one that we can do byte rearrange such that it would provide the same result.

src/hotspot/share/opto/vectornode.hpp line 1696:

> 1694: // shuffle vector into 1 that we can do byte rearrange such that it would provide the same result.
> 1695: // This can be done in VectorRearrangeNode during code emission but we eagerly expand out this
> 1696: // because it is often the case that an index vector is reused in many rearrange operations.

Thanks for this explanation!

`This can be done in VectorRearrangeNode` -> **could have** be done, because we now don't do it, right?
What do you mean by `expand out this`? Do you mean we have a separate dedicated node, so that it could possibly fold away with other nodes, such as index vector?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/21042#pullrequestreview-2487987908
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875503753
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875517727
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875519134
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875537761
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875540023


More information about the hotspot-compiler-dev mailing list