RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation [v2]

Tue Mar 21 16:16:33 UTC 2023

On Sun, 19 Mar 2023 19:38:04 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> Hi,
>> 
>> This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks:
>> 
>> 1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically.
>> 2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations.
>> 3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler.
>> 4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones.
>> 
>> Upon these changes, a `rearrange` can emit more efficient code:
>> 
>>     var species = IntVector.SPECIES_128;
>>     var v1 = IntVector.fromArray(species, SRC1, 0);
>>     var v2 = IntVector.fromArray(species, SRC2, 0);
>>     v1.rearrange(v2.toShuffle()).intoArray(DST, 0);
>> 
>>     Before:
>>     movabs $0x751589fa8,%r10            ;   {oop([I{0x0000000751589fa8})}
>>     vmovdqu 0x10(%r10),%xmm2
>>     movabs $0x7515a0d08,%r10            ;   {oop([I{0x00000007515a0d08})}
>>     vmovdqu 0x10(%r10),%xmm1
>>     movabs $0x75158afb8,%r10            ;   {oop([I{0x000000075158afb8})}
>>     vmovdqu 0x10(%r10),%xmm0
>>     vpand  -0xddc12(%rip),%xmm0,%xmm0        # Stub::vector_int_to_byte_mask
>>                                                             ;   {external_word}
>>     vpackusdw %xmm0,%xmm0,%xmm0
>>     vpackuswb %xmm0,%xmm0,%xmm0
>>     vpmovsxbd %xmm0,%xmm3
>>     vpcmpgtd %xmm3,%xmm1,%xmm3
>>     vtestps %xmm3,%xmm3
>>     jne    0x00007fc2acb4e0d8
>>     vpmovzxbd %xmm0,%xmm0
>>     vpermd %ymm2,%ymm0,%ymm0
>>     movabs $0x751588f98,%r10            ;   {oop([I{0x0000000751588f98})}
>>     vmovdqu %xmm0,0x10(%r10)
>> 
>>     After:
>>     movabs $0x751589c78,%r10            ;   {oop([I{0x0000000751589c78})}
>>     vmovdqu 0x10(%r10),%xmm1
>>     movabs $0x75158ac88,%r10            ;   {oop([I{0x000000075158ac88})}
>>     vmovdqu 0x10(%r10),%xmm2
>>     vpxor  %xmm0,%xmm0,%xmm0
>>     vpcmpgtd %xmm2,%xmm0,%xmm3
>>     vtestps %xmm3,%xmm3
>>     jne    0x00007fa818b27cb1
>>     vpermd %ymm1,%ymm2,%ymm0
>>     movabs $0x751588c68,%r10            ;   {oop([I{0x0000000751588c68})}
>>     vmovdqu %xmm0,0x10(%r10)
>>     
>> Please take a look and leave reviews. Thanks a lot.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix Matcher::vector_needs_load_shuffle

That looks like a very good simplification and performance enhancement, and removing a limitation, the byte[] representation. This should likely also help with Valhalla integration.

IIUC it has the same upper bound limitation for vector lengths greater than the maximum index size that can be represented as a lane element (although in practice there may not be any hardware where this can occur). Which is fine, i am not suggesting we try and fix this.

Perhaps it may be possible to move some methods on the concrete implementations to the abstract implementations as helper methods or template methods, thereby reducing the amount of generated code? It seems so in some cases, but i did not look very closely. It may require the introduction of an an element type specific abstract shuffle, and if that's the case it may not be worth it.

--

Relatedly, i would be interested in your opinion on the following. One annoyance in the API which propagates down into the implementation is `VectorShuffle<E>` and `VectorMask<E>` have `E` that is the lane element type. But, in theory they should not need E, and any shuffle or mask with the same lanes as the vector being operated on should be compatible, and it's an implementation detail of the shuffle/mask how its state represented as a hardware register. However, i don't have a good sense of the implications this has to the current HotSpot implementation and whether it is feasible.

-------------

PR Review: https://git.openjdk.org/jdk/pull/13093#pullrequestreview-1349108003