RFR: 8310691: [REDO] [vectorapi] Refactor VectorShuffle implementation [v7]

Emanuel Peter epeter at openjdk.org
Mon Dec 9 13:49:43 UTC 2024


On Mon, 9 Dec 2024 13:33:11 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly just the revert of the backout.
>> 
>> Regarding the related issues:
>> 
>> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed before the backout.
>> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to missing `ForceInline` on `AbstractVector::toBitsVectorTemplate`
>> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not been able to find the root causes. I'm not sure if this is a blocker, now I cannot even build x86-32 tests.
>> 
>> Finally, I moved some implementation of public methods and methods that call into intrinsics to the concrete class as that may help the compiler know the correct types of the variables.
>> 
>> Please take a look and leave reviews. Thanks a lot.
>> 
>> The description of the original PR:
>> 
>> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, `VectorShuffle` is stored as a byte array, and would be expanded upon usage. This poses several drawbacks:
>> 
>> Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically.
>> Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations.
>> Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler.
>> Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones.
>> Upon these changes, a `rearrange` can emit more efficient code:
>> 
>>     var species = IntVector.SPECIES_128;
>>     var v1 = IntVector.fromArray(species, SRC1, 0);
>>     var v2 = IntVector.fromArray(species, SRC2, 0);
>>     v1.rearrange(v2.toShuffle()).intoArray(DST, 0);
>> 
>>     Before:
>>     movabs $0x751589fa8,%r10            ;   {oop([I{0x0000000751589fa8})}
>>     vmovdqu 0x10(%r10),%xmm2
>>     movabs $0x7515a0d08,%r10            ;   {oop([I{0x00000007515a0d08})}
>>     vmovdqu 0x10(%r10),%xmm1
>>     movabs $0x75158afb8,%r10            ;   {oop([I{0x000000075158afb8})}
>>     vmovdqu 0x10(%r10),%xmm0
>>     vpand  -0xddc12(%rip),%xmm0,%xmm0        # Stub::vector_int_to_byt...
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   address reviews

Ok, looks better.

I've thought about this a little. And I am wondering if we cannot make the use of Rearrange generally easier.

What if we want to use the `VectorRearrangeNode` elsewhere?
One would assume that one could just check
`arch_supports_vector(Op_VectorRearrange, ...elem_bt)`
And then one could emit a `VectorRearrangeNode` for the given `elem_bt`. But that is not the case, the user would have to also check for `Matcher::vector_rearrange_requires_load_shuffle(shuffle_bt, num_elem)`, and possibly add a `VectorLoadShuffleNode`.
I don't like this - it makes the use quite cumbersome.

I have an idea for an alternative:
You add a `VectorRearrangeNode::Ideal`, which then transforms itself if we have `Matcher::vector_rearrange_requires_load_shuffle`.

I have not thought this through to the end, but it just seems it would be easier to use Rearrange in the future this way. But maybe we know that we will never use Rearrange in any other way, then I'm fine with this implementation. What do you think @PaulSandoz ?

src/hotspot/share/opto/vectornode.hpp line 1696:

> 1694: // shuffle vector into one that we can do byte rearrange such that it would provide the same
> 1695: // result. This could have been done in VectorRearrangeNode during code emission but we eagerly
> 1696: // expand out this because it is often the case that an index vector is reused in many rearrange

Suggestion:

// expand this out because it is often the case that an index vector is reused in many rearrange

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/21042#pullrequestreview-2488817071
PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1875989393


More information about the hotspot-compiler-dev mailing list