RFR: 8291600: [vectorapi] vector cast op check is not always needed for vector mask cast [v3]

Thu Aug 25 03:07:33 UTC 2022

On Mon, 22 Aug 2022 04:07:17 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Recently we found the performance of "`FIRST_NONZERO`" for double type is largely worse than the other types on x86 when `UseAVX=2`. The main reason is the "`VectorCastL2X`" op is not supported by the backend when the dst element type is `T_DOUBLE`. This makes the check of `VectorCast` op fail before intrinsifying "`VectorMask.cast()`" which is used in the
>> "`FIRST_NONZERO`" java implementation (see [1]). However, the compiler will not generate the `VectorCast `op for `VectorMask.cast()` if:
>> 
>>  1) the current platform supports the predicated feature
>>  2) the element size (in bytes) of the src and dst type is the same
>> 
>> So the check of "`VectorCast`" op is needless for such cases. To fix it, this patch:
>> 
>>  1) limits the specified vector cast op check to vectors
>>  2) adds the relative mask cast op check for VectorMask.cast()
>>  3) cleans up the unnecessary codes
>> 
>> Here is the performance of "`FIRST_NONZERO`" benchmark [2] on a x86 machine with `UseAVX=2`:
>> 
>> Benchmark                          (size) Mode Cnt Before  After   Units
>> DoubleMaxVector.FIRST_NONZERO       1024  thrpt 15 49.266 2460.886 ops/ms
>> DoubleMaxVector.FIRST_NONZEROMasked 1024  thrpt 15 49.554 1892.223 ops/ms
>> 
>> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/DoubleVector.java#L770
>> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/DoubleMaxVector.java#L246
>
> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix x86 codegen issue

src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp line 337:

> 335:   void vector_mask_cast(XMMRegister dst, XMMRegister src, BasicType dst_bt, BasicType src_bt, int vlen);
> 336: 
> 337:   void vector_mask_cast_with_tmp(XMMRegister dst, XMMRegister src, XMMRegister xtmp1,

I would prefer the name as `vector_mask_cast`.

src/hotspot/cpu/x86/x86.ad line 8452:

> 8450:   predicate(Matcher::vector_length(n) == Matcher::vector_length(n->in(1)) &&
> 8451:             Matcher::vector_length_in_bytes(n) > Matcher::vector_length_in_bytes(n->in(1)) &&
> 8452:             UseAVX == 1 &&

Since most x86 would be avx > 1, I would suggest testing `UseAVX == 1` first and then others.

src/hotspot/share/opto/vectornode.cpp line 1628:

> 1626:             // directly. This could avoid the transformation ordering issue from
> 1627:             // "VectorStoreMask (VectorLoadMask vmask) => vmask".
> 1628:             return new VectorMaskCastNode(value, vmask_type);

Why do you change this code?
Is it a must to enable this optimization?

-------------

PR: https://git.openjdk.org/jdk/pull/9737