RFR: 8304450: [vectorapi] Refactor VectorShuffle implementation

Sun Mar 19 13:10:09 UTC 2023

Hi,

This patch reimplements `VectorShuffle` implementations to be a vector of the bit type. Currently, VectorShuffle is stored as a byte array, and would be expanded upon usage. This poses several drawbacks:

1. Inefficient conversions between a shuffle and its corresponding vector. This hinders the performance when the shuffle indices are not constant and are loaded or computed dynamically.
2. Redundant expansions in `rearrange` operations. On all platforms, it seems that a shuffle index vector is always expanded to the correct type before executing the `rearrange` operations.
3. Some redundant intrinsics are needed to support this handling as well as special considerations in the C2 compiler.
4. Range checks are performed using `VectorShuffle::toVector`, which is inefficient for FP types since both FP conversions and FP comparisons are more expensive than the integral ones.

Upon these changes, a `rearrange` can emit more efficient code:

    var species = IntVector.SPECIES_128;
    var v1 = IntVector.fromArray(species, SRC1, 0);
    var v2 = IntVector.fromArray(species, SRC2, 0);
    v1.rearrange(v2.toShuffle()).intoArray(DST, 0);

    Before:
    movabs $0x751589fa8,%r10            ;   {oop([I{0x0000000751589fa8})}
    vmovdqu 0x10(%r10),%xmm2
    movabs $0x7515a0d08,%r10            ;   {oop([I{0x00000007515a0d08})}
    vmovdqu 0x10(%r10),%xmm1
    movabs $0x75158afb8,%r10            ;   {oop([I{0x000000075158afb8})}
    vmovdqu 0x10(%r10),%xmm0
    vpand  -0xddc12(%rip),%xmm0,%xmm0        # Stub::vector_int_to_byte_mask
                                                            ;   {external_word}
    vpackusdw %xmm0,%xmm0,%xmm0
    vpackuswb %xmm0,%xmm0,%xmm0
    vpmovsxbd %xmm0,%xmm3
    vpcmpgtd %xmm3,%xmm1,%xmm3
    vtestps %xmm3,%xmm3
    jne    0x00007fc2acb4e0d8
    vpmovzxbd %xmm0,%xmm0
    vpermd %ymm2,%ymm0,%ymm0
    movabs $0x751588f98,%r10            ;   {oop([I{0x0000000751588f98})}
    vmovdqu %xmm0,0x10(%r10)

    After:
    movabs $0x751589c78,%r10            ;   {oop([I{0x0000000751589c78})}
    vmovdqu 0x10(%r10),%xmm1
    movabs $0x75158ac88,%r10            ;   {oop([I{0x000000075158ac88})}
    vmovdqu 0x10(%r10),%xmm2
    vpxor  %xmm0,%xmm0,%xmm0
    vpcmpgtd %xmm2,%xmm0,%xmm3
    vtestps %xmm3,%xmm3
    jne    0x00007fa818b27cb1
    vpermd %ymm1,%ymm2,%ymm0
    movabs $0x751588c68,%r10            ;   {oop([I{0x0000000751588c68})}
    vmovdqu %xmm0,0x10(%r10)

Please take a look and leave reviews. Thanks a lot.

-------------

Commit messages:
 - fix internal types, clean up
 - optimise laneIsValid
 - Merge branch 'master' into shufflerefactor
 - small beautifications
 - other architecture
 - fix mismatched fp vector payload types
 - draft

Changes: https://git.openjdk.org/jdk/pull/13093/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13093&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8304450
  Stats: 4440 lines in 62 files changed: 2567 ins; 651 del; 1222 mod
  Patch: https://git.openjdk.org/jdk/pull/13093.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/13093/head:pull/13093

PR: https://git.openjdk.org/jdk/pull/13093