RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v25]

Fei Yang fyang at openjdk.org
Mon Apr 24 06:53:58 UTC 2023


On Fri, 21 Apr 2023 13:29:50 GMT, Dingli Zhang <dzhang at openjdk.org> wrote:

>> HI,
>> 
>> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot!
>> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1].
>> 
>> ## Load/Store/Cmp Mask
>> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`:
>> 
>> 218     loadV V1, [R7]	# vector (rvv)
>> 220     vloadmask V0, V1
>> ...
>> 23c     vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0
>> 24c     vstoremask V1, V0
>> 258     storeV [R7], V1	# vector (rvv)
>> 
>> 
>> The corresponding generated jit assembly:
>> 
>> # loadV
>> 0x000000400c8ef958:   vsetvli t0,zero,e8,m1,tu,mu
>> 0x000000400c8ef95c:   vle8.v  v1,(t2)
>> 
>> # vloadmask
>> 0x000000400c8ef960:   vsetvli t0,zero,e8,m1,tu,
>> 0x000000400c8ef964:   vmsne.vx    v0,v1,zero
>> 
>> # vmaskcmp_rvv_masked
>> 0x000000400c8ef97c:   vsetvli   t0,zero,e8,m1,tu,mu
>> 0x000000400c8ef980:   vmclr.m   v1
>> 0x000000400c8ef984:   vmseq.vv  v1,v4,v5,v0.t
>> 0x000000400c8ef988:   vmv1r.v   v0,v1
>> 
>> # vstoremask
>> 0x000000400c8ef98c:   vsetvli t0,zero,e8,m1,tu,mu
>> 0x000000400c8ef990:   vmv.v.x v1,zero
>> 0x000000400c8ef994:   vmerge.vim  v1,v1,1,v0
>> 
>> 
>> ## Masked vector arithmetic instructions (e.g. vadd)
>> AddMaskTestMerge case:
>> 
>> import jdk.incubator.vector.IntVector;
>> import jdk.incubator.vector.VectorMask;
>> import jdk.incubator.vector.VectorOperators;
>> import jdk.incubator.vector.VectorSpecies;
>> 
>> public class AddMaskTestMerge {
>> 
>>     static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_128;
>>     static final int SIZE = 1024;
>>     static int[] a = new int[SIZE];
>>     static int[] b = new int[SIZE];
>>     static int[] r = new int[SIZE];
>>     static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false};
>>     static {
>>         for (int i = 0; i < SIZE; i++) {
>>             a[i] = i;
>>             b[i] = i;
>>         }
>>     }
>> 
>>     static void workload(int idx) {
>>         VectorMask<Integer> vmask = VectorMask.fromArray(SPECIES, c, 0);
>>         IntVector av = IntVector.fromArray(SPECIES, a, idx);
>>         IntVector bv = IntVector.fromArray(SPECIES, b, idx);
>>         av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx);
>>     }
>> 
>>     public static void main(String[] args) {
>>         for (int i = 0; i < 30_0000; i++) {
>>             for (int j = 0; j < SIZE; j += SPECIES.length()) {
>>                 workload(j);
>>             }
>>         }
>>     }
>> }
>> 
>> 
>> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar.
>> 
>> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows:
>> 
>> 
>> 0ae     B10: #	out( B25 B11 ) <- in( B9 )  Freq: 0.999991
>> 0ae     loadV V1, [R31]	# vector (rvv)
>> 0b6     vloadmask V0, V2
>> 0be     vadd.vv V3, V1, V0	#@vaddI_masked
>> 0c6     lwu  R28, [R7, #124]	# loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r
>> 0ca     decode_heap_oop  R28, R28	#@decodeHeapOop
>> 0cc     lwu  R7, [R28, #12]	# range, #@loadRange
>> 0d0     NullCheck R28
>> 
>> 
>> And the jit code is as follows:
>> 
>> 
>> 0x000000400c823cee:   vsetvli t0,zero,e32,m1,tu,mu
>> 0x000000400c823cf2:   vle32.v v1,(t6)                     ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0}
>>                                                           ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228)
>>                                                           ; - AddMaskTestMerge::workload at 46 (line 25)
>> 0x000000400c823cf6:   vsetvli t0,zero,e8,m1,tu,mu
>> 0x000000400c823cfa:   vmsne.vx        v0,v2,zero          ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0}
>>                                                           ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208)
>>                                                           ; - AddMaskTestMerge::workload at 7 (line 22)
>> 0x000000400c823cfe:   vsetvli t0,zero,e32,m1,tu,mu
>> 0x000000400c823d02:   vadd.vv v3,v3,v1,v0.t               ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0}
>>                                                           ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834)
>>                                                           ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291)
>>                                                           ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41)
>>                                                           ; - AddMaskTestMerge::workload at 39 (line 25)
>> 
>> 
>> ## Mask register allocation & mask bit opreation
>> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3].
>> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated:
>> 
>> <intrinsic id='_VectorBinaryOp' nodes='20'/>
>> <method_not_compilable_at_tier level='4'/>
>> <failure reason='failed spill-split-recycle sanity check' phase='compile'/>
>> <failure reason='failed spill-split-recycle sanity check'/>
>> <task_done success='0' nmsize='0' count='22784' stamp='16.146'/>
>> 
>> 
>> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like:
>> 
>> vloadmask V0, V1
>> vloadmask V30, V2
>> vmask_and V0, V30, V0
>> 
>> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0.
>> 
>> ## vector load/store - predicated & blend opreation
>> 
>> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store:
>> 
>> 152     B21: #	out( B22 ) <- in( B20 )  Freq: 0.499984
>> 152     vmask_gen_L V0, R12
>> 162     loadV_masked V1, V0, [R10]
>> 16e     storeV_masked [R11], V0, V1
>> 
>> 
>> And `VectorBlend` will generate the following compilation log (part of rotate opreation):
>> 
>> 1ea     vlsrBS V6, V1, V3 V0
>> 1fe     vlslBS V5, V1, V2 V0
>> 212     vor.vv  V2, V5, V6	#@vor
>> 21a     vloadmask V0, V4
>> 222     vmerge_vvm V1, V1, V2	# vector blend
>> 22a     bgeu  R9, R30, B56	#@cmpU_branch  P=0.000001 C=-1.000000
>> 
>> 
>> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done.
>> 
>> 
>> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc
>> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java
>> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526
>> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java
>> 
>> ### Testing:
>> 
>> qemu with UseRVV:
>> - [x] Tier1 tests (release)
>> - [x] Tier2 tests (release)
>> - [x] Tier3 tests (release)
>> - [x] test/jdk/jdk/incubator/vector (release/fastdebug)
>
> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add some vector pseudo instructions

Changes requested by fyang (Reviewer).

src/hotspot/cpu/riscv/riscv_v.ad line 179:

> 177: %}
> 178: 
> 179: instruct vmaskcmp_masked(vRegMask dst, vReg src1, vReg src2, immI cond, vRegMask_V0 vmask, vReg tmp) %{

I think we can introduce another new operand type (say 'vRegMaskNoV0') which excludes mask register 'v0' for 'dst' here and other places where 'v0' could not be used as the destination register for a masked vector instruction as required by the RVV spec. Then we could eliminate the use of 'tmp' register and 'vmv1r.v' instruction.

Also, I would like to further rename 'vRegMask_V0 vmask' into 'vRegMask_V0 v0'. The RVV spec says that the mask value used to control execution of a masked vector instruction is always supplied by vector register 'v0' for now.

-------------

PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1397254398
PR Review Comment: https://git.openjdk.org/jdk/pull/12682#discussion_r1174827238


More information about the hotspot-compiler-dev mailing list