RFR: 8303762: [vectorapi] Intrinsification of Vector.slice [v6]

Tue Apr 25 12:22:20 UTC 2023

On Tue, 4 Apr 2023 13:46:12 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>> `Vector::slice` is a method at the top-level class of the Vector API that concatenates the 2 inputs into an intermediate composite and extracts a window equal to the size of the inputs into the result. It is used in vector conversion methods where the part number is not 0 to slice the parts to the correct positions. Slicing is also used in text processing such as utf8 and utf16 validation. x86 starting from SSSE3 has `palignr` which does vector slicing very efficiently. As a result, I think it is beneficial to add a C2 node for this operation as well as intrinsify `Vector::slice` method.
>> 
>> A slice is currently implemented as `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires preparation of the index vector and the blending mask. Even with the preparations being hoisted out of the loops, microbenchmarks show improvement using the slice instrinsics. Some have tremendous increases in throughput due to the limitation that a mask of length 2 cannot currently be intrinsified, leading to falling back to the Java implementations.
>> 
>> Please take a look and have some reviews. Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   style

src/hotspot/cpu/x86/x86.ad line 7953:

> 7951:       __ punpckldq($dst$$XMMRegister, $src$$XMMRegister);
> 7952:     }
> 7953:     __ psrldq($dst$$XMMRegister, $origin$$constant * type2aelembytes(bt));

Move it to a new macro assembly routine.

src/hotspot/cpu/x86/x86.ad line 7962:

> 7960:             !VM_Version::supports_ssse3());
> 7961:   match(Set dst (VectorSlice (Binary dst src) origin));
> 7962:   effect(TEMP xtmp);

Please also associate TEMP_DEF / TEMP with dst to avoid early source overwrite in case dst/src are allocated same register.

src/hotspot/cpu/x86/x86.ad line 7970:

> 7968:     __ movdqu($xtmp$$XMMRegister, $src$$XMMRegister);
> 7969:     __ pslldq($xtmp$$XMMRegister, 16 - shift_count);
> 7970:     __ por($dst$$XMMRegister, $xtmp$$XMMRegister);

Move to macro assembly routine.

src/hotspot/cpu/x86/x86.ad line 8007:

> 8005:       }
> 8006:       __ vpsrldq($dst$$XMMRegister, $dst$$XMMRegister, shift_count, Assembler::AVX_128bit);
> 8007:     }

Move to macro assembly routine.

src/hotspot/cpu/x86/x86.ad line 8063:

> 8061:             (type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int()) % 4U != 0 &&
> 8062:             (type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int() < 16 ||
> 8063:              type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int() > 48));

Move these bulky predications to source_hpp section, like done at 
https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L8786

src/hotspot/cpu/x86/x86.ad line 8082:

> 8080:             type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int() > 16 &&
> 8081:             type2aelembytes(Matcher::vector_element_basic_type(n)) * n->in(2)->get_int() < 48);
> 8082:   match(Set dst (VectorSlice (Binary src1 src2) origin));

Same as above.

src/hotspot/cpu/x86/x86.ad line 8099:

> 8097:             (Matcher::vector_length_in_bytes(n) == 64 ||
> 8098:              (Matcher::vector_length_in_bytes(n) == 32 &&
> 8099:               VM_Version::supports_avx512vl())));

Same as above.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1914:

> 1912:   if (vector_klass->const_oop() == NULL || elem_klass->const_oop() == NULL ||
> 1913:       !vlen->is_con() || !origin_type->is_con()) {
> 1914:     if (C->print_intrinsics()) {

Hi @merykitty , your inline expander is not handling non-constant origin case, this will introduce performance regressions w.r.t to existing implementation.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428922
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176424735
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176429190
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176429410
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428080
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428309
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176428542
PR Review Comment: https://git.openjdk.org/jdk/pull/12909#discussion_r1176407424