RFR: 8286941: Add mask IR for partial vector operations for ARM SVE
Xiaohong Gong
xgong at openjdk.java.net
Wed Jun 8 03:00:37 UTC 2022
On Wed, 8 Jun 2022 02:25:48 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:
>> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct.
>>
>> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops.
>>
>> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop.
>>
>> Here is an example for vector load and add reduction inside a loop:
>>
>> ptrue p0.s, vl8 ; mask generation
>> ld1w {z16.s}, p0/z, [x14] ; load vector
>>
>> ptrue p0.s, vl8 ; mask generation
>> uaddv d17, p0, z16.s ; add reduction
>> smov x14, v17.s[0]
>>
>> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop.
>>
>> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out.
>>
>> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system:
>>
>> Benchmark size Gain
>> Byte256Vector.ADDLanes 1024 0.999
>> Byte256Vector.ANDLanes 1024 1.065
>> Byte256Vector.MAXLanes 1024 1.064
>> Byte256Vector.MINLanes 1024 1.062
>> Byte256Vector.ORLanes 1024 1.072
>> Byte256Vector.XORLanes 1024 1.041
>> Short256Vector.ADDLanes 1024 1.017
>> Short256Vector.ANDLanes 1024 1.044
>> Short256Vector.MAXLanes 1024 1.049
>> Short256Vector.MINLanes 1024 1.049
>> Short256Vector.ORLanes 1024 1.089
>> Short256Vector.XORLanes 1024 1.047
>> Int256Vector.ADDLanes 1024 1.045
>> Int256Vector.ANDLanes 1024 1.078
>> Int256Vector.MAXLanes 1024 1.123
>> Int256Vector.MINLanes 1024 1.129
>> Int256Vector.ORLanes 1024 1.078
>> Int256Vector.XORLanes 1024 1.072
>> Long256Vector.ADDLanes 1024 1.059
>> Long256Vector.ANDLanes 1024 1.101
>> Long256Vector.MAXLanes 1024 1.079
>> Long256Vector.MINLanes 1024 1.099
>> Long256Vector.ORLanes 1024 1.098
>> Long256Vector.XORLanes 1024 1.110
>> Float256Vector.ADDLanes 1024 1.033
>> Float256Vector.MAXLanes 1024 1.156
>> Float256Vector.MINLanes 1024 1.151
>> Double256Vector.ADDLanes 1024 1.062
>> Double256Vector.MAXLanes 1024 1.145
>> Double256Vector.MINLanes 1024 1.140
>>
>> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below:
>>
>> sxtw x14, w14
>> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14
>
> src/hotspot/share/opto/vectornode.cpp line 868:
>
>> 866: default:
>> 867: node->add_req(mask);
>> 868: node->add_flag(Node::Flag_is_predicated_vector);
>
> Add assert that only VectorMaskOpNode and ReductionNode expected here.
We have other vectornodes like `VectorMaskCmp` , `MaskAll` and `VectorLoadMask` also needs to append the mask here. Actually most masked vector nodes accept the mask input except for the load/store/gather/scatter. And in future, we may extend this to other normal vector nodes whose vector length is full-size while not partial, since SVE always needs a predicate for most instructions. So the default patch will be used for most vector nodes.
> src/hotspot/share/opto/vectornode.cpp line 951:
>
>> 949:
>> 950: Node* LoadVectorNode::Ideal(PhaseGVN* phase, bool can_reshape) {
>> 951: const TypeVect* vt = as_LoadVector()->vect_type();
>
> Why you need `as_LoadVector()` fro `this`? Same in `StoreVectorNode::Ideal().
Good catch and thanks! We could directly use "vect_type()" here. I will change this later.
> src/hotspot/share/opto/vectornode.cpp line 988:
>
>> 986: }
>> 987: }
>> 988: return LoadNode::Ideal(phase, can_reshape);
>
> Should this call `LoadVectorNode::Ideal`?
> I understand you did optimization because `vector_needs_partial_operations` is false for `LoadVectorMaskedNode` in aarch64 case. But what if it is different on some other (not current) platform?
Right, calling `LoadVectorNode::Ideal()` is better. I will change this later. Thanks.
> src/hotspot/share/opto/vectornode.cpp line 1008:
>
>> 1006: }
>> 1007: }
>> 1008: return StoreNode::Ideal(phase, can_reshape);
>
> Should this call `StoreVectorNode::Ideal`?
ditto
> src/hotspot/share/opto/vectornode.cpp line 1821:
>
>> 1819: // Transform (MaskAll m1 (VectorMaskGen len)) ==> (VectorMaskGen len)
>> 1820: // if the vector length in bytes is lower than the MaxVectorSize.
>> 1821: if (is_con_M1(in(1)) && length_in_bytes() < MaxVectorSize) {
>
> Due to #8877 such length check may not correct here.
> And I don't see `in(2)->Opcode() == Op_VectorMaskGen` check.
I think changes in #8877 influences the max vector length in superword? And since `MaskAll` is used for VectorAPI, the `MaxVectorSize` is still the right reference? @jatin-bhateja, could you please help to check whether this has any influence on x86 avx-512 system? Thanks so much!
> And I don't see in(2)->Opcode() == Op_VectorMaskGen check.
Yes, the `Op_VectorMaskGen` is not generated for `MaskAll` when its input is a constant. We directly transform the `MaskAll` to `VectorMaskGen` here, since they two have the same meanings. Thanks!
-------------
PR: https://git.openjdk.java.net/jdk/pull/9037
More information about the hotspot-compiler-dev
mailing list