RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v3]

Jatin Bhateja jbhateja at openjdk.org
Thu Jun 16 12:26:30 UTC 2022


On Tue, 14 Jun 2022 08:59:38 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct.
>> 
>> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops.
>> 
>> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop.
>> 
>> Here is an example for vector load and add reduction inside a loop:
>> 
>>   ptrue   p0.s, vl8             ; mask generation
>>   ld1w    {z16.s}, p0/z, [x14]  ; load vector
>> 
>>   ptrue   p0.s, vl8             ; mask generation
>>   uaddv   d17, p0, z16.s        ; add reduction
>>   smov    x14, v17.s[0]
>> 
>> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop.
>> 
>> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out.
>> 
>> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system:
>> 
>>   Benchmark                  size   Gain
>>   Byte256Vector.ADDLanes     1024   0.999
>>   Byte256Vector.ANDLanes     1024   1.065
>>   Byte256Vector.MAXLanes     1024   1.064
>>   Byte256Vector.MINLanes     1024   1.062
>>   Byte256Vector.ORLanes      1024   1.072
>>   Byte256Vector.XORLanes     1024   1.041
>>   Short256Vector.ADDLanes    1024   1.017
>>   Short256Vector.ANDLanes    1024   1.044
>>   Short256Vector.MAXLanes    1024   1.049
>>   Short256Vector.MINLanes    1024   1.049
>>   Short256Vector.ORLanes     1024   1.089
>>   Short256Vector.XORLanes    1024   1.047
>>   Int256Vector.ADDLanes      1024   1.045
>>   Int256Vector.ANDLanes      1024   1.078
>>   Int256Vector.MAXLanes      1024   1.123
>>   Int256Vector.MINLanes      1024   1.129
>>   Int256Vector.ORLanes       1024   1.078
>>   Int256Vector.XORLanes      1024   1.072
>>   Long256Vector.ADDLanes     1024   1.059
>>   Long256Vector.ANDLanes     1024   1.101
>>   Long256Vector.MAXLanes     1024   1.079
>>   Long256Vector.MINLanes     1024   1.099
>>   Long256Vector.ORLanes      1024   1.098
>>   Long256Vector.XORLanes     1024   1.110
>>   Float256Vector.ADDLanes    1024   1.033
>>   Float256Vector.MAXLanes    1024   1.156
>>   Float256Vector.MINLanes    1024   1.151
>>   Double256Vector.ADDLanes   1024   1.062
>>   Double256Vector.MAXLanes   1024   1.145
>>   Double256Vector.MINLanes   1024   1.140
>> 
>> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below:
>> 
>>   sxtw    x14, w14
>>   whilelo p0.s, xzr, x14  =>  whilelo p0.s, wzr, w14
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:
> 
>  - Address review comments, revert changes for gatherL/scatterL rules
>  - Merge branch 'jdk:master' into JDK-8286941
>  - Revert transformation from MaskAll to VectorMaskGen, address review comments
>  - 8286941: Add mask IR for partial vector operations for ARM SVE

src/hotspot/share/opto/vectornode.cpp line 864:

> 862:   // Generate a vector mask for vector operation whose vector length is lower than the
> 863:   // hardware supported max vector length.
> 864:   if (vt->length_in_bytes() < MaxVectorSize) {

For completeness, length comparison check can be done against MIN(SuperWordMaxVectorSize, MaxVectorSize).
Even though SuperWordMaxVector differs from MaxVectorSize only for certain X86 targets and this control flow is only executed for AARCH64 SVE targets currently.

src/hotspot/share/opto/vectornode.cpp line 1013:

> 1011:     }
> 1012:   }
> 1013:   return LoadVectorNode::Ideal(phase, can_reshape);

These predicated nodes are concrete ones with fixed species and carry user specified mask, I am not clear why do we need a mask re-computation for predicated nodes.  

Higher lanes of predicated operand should already be zero and mask attached to predicated node should be correct by construction, since mask lane count is always equal to vector lane count.

src/hotspot/share/opto/vectornode.cpp line 1033:

> 1031:     }
> 1032:   }
> 1033:   return StoreVectorNode::Ideal(phase, can_reshape);

Same as above.

src/hotspot/share/opto/vectornode.cpp line 1669:

> 1667:   if (Matcher::vector_needs_partial_operations(this, vt)) {
> 1668:     return VectorNode::try_to_gen_masked_vector(phase, this, vt);
> 1669:   }

This is a parent node of TrueCount/FirstTrue/LastTrue and MaskToLong which perform mask querying operation on concrete predicate operands, a transformation here looks redundant to me.

-------------

PR: https://git.openjdk.org/jdk/pull/9037


More information about the hotspot-compiler-dev mailing list