RFR: 8318650: Optimized subword gather for x86 targets. [v3]
Sandhya Viswanathan
sviswanathan at openjdk.org
Fri Nov 3 23:37:14 UTC 2023
On Tue, 31 Oct 2023 07:19:55 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Hi All,
>>
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>>
>> Following is the summary of changes:-
>>
>> 1) Intrinsify sub-word gather with high performance backend implementation based on hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact
>> JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.3-5x gains with included micro.
>>
>>
>> 
>>
>>
>> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>>
>> 3) Some minor adjustments in existing gather instruction pattens for double/quad words.
>>
>>
>> Kindly review and share your feedback.
>>
>>
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Restricting masked sub-word gather to AVX512 target to align with integral gather support.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1576:
> 1574: Label* larr[] = { &case0, &case1, &case2, &case3, &case4, &case5, &case6, &case7 };
> 1575: for (int i = 0; i < 8; i++) {
> 1576: bt(mask, midx);
Could we not use smaller length bt and inc instructions (e.g. 32 bit one) here as we know that we dont need 64 bits of mask here? That way we will have smaller instruction encoding.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1606:
> 1604: void C2_MacroAssembler::vpgather8b_offset(BasicType elem_bt, XMMRegister dst, Register base, Register idx_base,
> 1605: Register offset, Register rtmp, int vlen_enc) {
> 1606: vpxor(dst, dst, dst, vlen_enc);
Why the vpxor here and in vpgather8b. These are not masked gathers.
src/hotspot/cpu/x86/x86.ad line 4133:
> 4131: %}
> 4132:
> 4133: instruct vgather_subwordGT8B(vec dst, memory mem, rRegP idx, immI_0 offset, rRegP tmp, rRegP idx_base_temp,
We need to add rFlagsReg as being killed in many of these new instructs where implementation uses add instruction etc.
src/hotspot/cpu/x86/x86.ad line 4152:
> 4150:
> 4151: instruct vgather_subwordLE8B_off(vec dst, memory mem, rRegP idx, rRegI offset, rRegP tmp, rRegI rtmp) %{
> 4152: predicate(is_subword_type(Matcher::vector_element_basic_type(n)) && Matcher::vector_length_in_bytes(n) <= 8);
!VM_Version::supports_avx512bw() is missing for non avx3 cases.
src/hotspot/cpu/x86/x86.ad line 4258:
> 4256: __ kmovql($rtmp2$$Register, $mask$$KRegister);
> 4257: __ vgather_subword(elem_bt, $dst$$XMMRegister, $tmp$$Register, $idx_base_temp$$Register, $offset$$Register, $rtmp2$$Register, $xtmp1$$XMMRegister,
> 4258: $xtmp2$$XMMRegister, $xtmp3$$XMMRegister, $rtmp$$Register, $midx$$Register, $length$$Register, vector_len, vlen_enc);
The naming vpgather vs vgather could be made consistent.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382282928
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382278440
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382277389
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382270171
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1382270783
More information about the hotspot-compiler-dev
mailing list