RFR: 8318650: Optimized subword gather for x86 targets. [v3]

Sandhya Viswanathan sviswanathan at openjdk.org
Fri Nov 3 21:27:14 UTC 2023


On Tue, 31 Oct 2023 07:19:55 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather with high performance backend implementation based on hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact
>> JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.3-5x gains with included micro.
>> 
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>> 
>> 3) Some minor adjustments in existing gather instruction pattens for double/quad words.
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Restricting masked sub-word gather to AVX512 target to align with integral gather support.

src/hotspot/cpu/x86/assembler_x86.cpp line 13565:

> 13563: void Assembler::bt(Register dst, Register src) {
> 13564:   int encode = prefixq_and_encode(src->encoding(), dst->encoding());
> 13565:   emit_int24(0x0F, 0xA3, (encode | 0xC0));

(unsigned char)0xA3 would be better.

src/hotspot/cpu/x86/x86.ad line 1921:

> 1919:     case Op_LoadVectorGatherMasked:
> 1920:       if (!is_subword_type(bt) && size_in_bits < 512 && !VM_Version::supports_avx512vl()) {
> 1921:         return false;

Also need to return false when size_in_bits == 64 for non subword types.

src/hotspot/cpu/x86/x86.ad line 1939:

> 1937:       // fallthrough
> 1938:     case Op_LoadVectorGather:
> 1939:       if (!is_subword_type(bt) && size_in_bits == 64 ) {

Since this is a fallthrough case, the change also applies to StoreVectorScatter*.  The !is_subword_type() is needed only for LoadVectorGather.

src/hotspot/cpu/x86/x86.ad line 4074:

> 4072:     BasicType elem_bt = Matcher::vector_element_basic_type(this);
> 4073:     assert(!is_subword_type(elem_bt), "sanity"); // T_INT, T_LONG, T_FLOAT, T_DOUBLE
> 4074:     __ vpcmpeqd($mask$$XMMRegister, $mask$$XMMRegister, $mask$$XMMRegister, vlen_enc);

vpcmpeqd is expensive instruction as compared to movdqu and also unrelated to subword type  support.

src/hotspot/share/opto/matcher.cpp line 2495:

> 2493:         n->del_req(MemNode::ValueIn+1);
> 2494:         break;
> 2495:       }

The "break;" should be after the "}".

src/hotspot/share/opto/vectorIntrinsics.cpp line 1507:

> 1505: 
> 1506:   // Check that the vector holding indices is supported by architecture
> 1507:   if (!is_subword_type(elem_bt) && !arch_supports_vector(Op_LoadVector, num_elem, T_INT, VecMaskNotUsed)) {

A comment here would be good to indicate why we are not doing this check for subword type. Also as argument(8) is passed as null in this case so instead of checking !is_subword_type we could check for argument(8) to be not null.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1551:

> 1549:   Node* index_vect = nullptr;
> 1550:   const TypeInstPtr* vbox_idx_type = TypeInstPtr::make_exact(TypePtr::NotNull, vbox_idx_klass);
> 1551:   if (!is_subword_type(elem_bt)) {

This could check for argument(8) to be not null instead.

src/hotspot/share/opto/vectorIntrinsics.cpp line 1579:

> 1577:   Node* index    = argument(11);
> 1578:   Node* indexMap = argument(12);
> 1579:   Node* indexM   = argument(13);

Could be renamed as follows for better understanding: index -> offset, indexM -> indexOffset. Also this should be moved under else part of if (is_scatter).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382214927
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382184378
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382180358
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382188204
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382147380
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382152327
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382155655
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382168104


More information about the hotspot-compiler-dev mailing list