RFR: 8291600: [vectorapi] vector cast op check is not always needed for vector mask cast

Fri Aug 5 02:01:48 UTC 2022

On Thu, 4 Aug 2022 06:08:44 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> Recently we found the performance of "`FIRST_NONZERO`" for double type is largely worse than the other types on x86 when `UseAVX=2`. The main reason is the "`VectorCastL2X`" op is not supported by the backend when the dst element type is `T_DOUBLE`. This makes the check of `VectorCast` op fail before intrinsifying "`VectorMask.cast()`" which is used in the
> "`FIRST_NONZERO`" java implementation (see [1]). However, the compiler will not generate the `VectorCast `op for `VectorMask.cast()` if:
> 
>  1) the current platform supports the predicated feature
>  2) the element size (in bytes) of the src and dst type is the same
> 
> So the check of "`VectorCast`" op is needless for such cases. To fix it, this patch:
> 
>  1) limits the specified vector cast op check to vectors
>  2) adds the relative mask cast op check for VectorMask.cast()
>  3) cleans up the unnecessary codes
> 
> Here is the performance of "`FIRST_NONZERO`" benchmark [2] on a x86 machine with `UseAVX=2`:
> 
> Benchmark                          (size) Mode Cnt Before  After   Units
> DoubleMaxVector.FIRST_NONZERO       1024  thrpt 15 49.266 2460.886 ops/ms
> DoubleMaxVector.FIRST_NONZEROMasked 1024  thrpt 15 49.554 1892.223 ops/ms
> 
> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/DoubleVector.java#L770
> [2] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/DoubleMaxVector.java#L246

Thanks a lot for looking at this PR!

> May I ask if we can use the `VectorMaskCast` nodes for the non-predicated mask casts. As the value of the elements can only be 0 or -1, we can generate better code on avx, as we don't need to truncate the elements first as in the general cases. Thanks a lot.

If the element size of src and dst type is equal, we can use `VectorMaskCast` which doesn't need to emit any instructions. But for others like `S -> I`, `S->B`, I'm afraid that we need the extending or narrowing instructions for non-predicated mask casts. 

For example, for a short vector mask A with 128 bits: 

A:  0000 1111 0000 1111 0000 1111 0000 1111

If we want to cast it to an int vector mask B with 256 bits, the result should be extended to:

B: 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111

And if we want to cast it to a byte vector mask C with 64 bits, the result should be narrowed to:

C: 00 11 00 11 00 11 00 11

And yes, for such cases, we can also generate a `VectorMaskCast` node and together with the relative backend match rules for it when the element size is different, which I think the codes will be the same with `VectorCast`.  This makes the codes in mid-end cleaner, but may need some duplicate match rules.  I'm not quite familiar with avx instructions, so for such cases, does avx has better instructions than the VectorCast for it? If so, I'd like to always use `VectorMaskCast` for all platforms and add the backend rules if needed. Also considering I'm not familiar with the x86 instructions, the change may needs your help. WDYT?

-------------

PR: https://git.openjdk.org/jdk/pull/9737