RFR: 8318650: Optimized subword gather for x86 targets.
Jatin Bhateja
jbhateja at openjdk.org
Fri Oct 27 05:12:34 UTC 2023
On Thu, 26 Oct 2023 07:21:28 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> Hi All,
>>
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>>
>> Following is the summary of changes:-
>>
>> 1) Intrinsify sub-word gather with high performance backend implementation based on hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact
>> JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.3-5x gains with included micro.
>>
>>
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>>
>>
>> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>>
>> 3) Some minor adjustments in existing gather instruction pattens for double/quad words.
>>
>>
>> Kindly review and share your feedback.
>>
>>
>> Best Regards,
>> Jatin
>
> src/hotspot/share/opto/vectorIntrinsics.cpp line 1486:
>
>> 1484: // Check whether the predicated gather/scatter node is supported by architecture.
>> 1485: VectorMaskUseType mask = (is_scatter || !is_subword_type(elem_bt)) ? (VectorMaskUseType) (VecMaskUseLoad | VecMaskUsePred) : VecMaskUseLoad;
>> 1486: if (!arch_supports_vector(is_scatter ? Op_StoreVectorScatterMasked : Op_LoadVectorGatherMasked, num_elem, elem_bt, mask)) {
>
> What is the difference between subword-type load gather and others? It seems only check `VecMaskUseLoad` is enough for all kinds of masked gather/scatter. Only checking `is_match_rule_supported_vector` for these ops are ok. WDYT?
sub-word gather do not emit any predicated instructions, thus only VectorMaskUseLoad is relevant in this context, however AVX512 and SVE does have a direct predicated gather instructions for 32/64 bit types.
> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3062:
>
>> 3060: // Constant folding should sweep out following conditonal logic.
>> 3061: if (isp.length() > IntVector.SPECIES_PREFERRED.length()) {
>> 3062: lsp = IntVector.SPECIES_PREFERRED;
>
> What happens if the needed index vector length is larger that the max vector length of int type?
> For example, byte vectors with `SPECIES_128`, may needs an index vector with `IntVector.SPECIES_512` species? And on some architectures like NEON, or SVE with 128-bit max vector size, the preferred species is only `SPECIES_128`.
For x86 backend implementation we emit a gather loop hence index species is not used for sub-word, I am performing the index out of bound in a loop which operate a max integer vector granularity supported by target.
https://github.com/openjdk/jdk/pull/16354/files#diff-13cc2d6ec18e487ddae05cda671bdb6bb7ffd42ff7bc51a2e00c8c5e622bd55dR3641
Because there is cap on max integral vector size, thus for targets supporting direct sub-word gather instruction will have to emit an instruction sequence comprising of partial indexMap loads into integral vectors and issue multiple gather operations.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1374077991
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1374077927
More information about the hotspot-compiler-dev
mailing list