RFR: 8302908: RISC-V: Support masked vector arithmetic instructions for Vector API [v19]
Yanhong Zhu
yzhu at openjdk.org
Tue Apr 18 07:05:50 UTC 2023
On Mon, 17 Apr 2023 14:12:34 GMT, Dingli Zhang <dzhang at openjdk.org> wrote:
>> HI,
>>
>> We have added support for vector add mask instructions, please take a look and have some reviews. Thanks a lot!
>> This patch will add support of vector add/sub/mul/div mask version. It was implemented by referring to RVV v1.0 [1].
>>
>> ## Load/Store/Cmp Mask
>> `VectorLoadMask, VectorMaskCmp, VectorStoreMask` will implement the mask datapath. We can see where the data is passed in the compilation log with `jdk/incubator/vector/Byte128VectorTests.java`:
>>
>> 218 loadV V1, [R7] # vector (rvv)
>> 220 vloadmask V0, V1
>> ...
>> 23c vmaskcmp_rvv_masked V0, V4, V5, V0, V1, #0
>> 24c vstoremask V1, V0
>> 258 storeV [R7], V1 # vector (rvv)
>>
>>
>> The corresponding generated jit assembly:
>>
>> # loadV
>> 0x000000400c8ef958: vsetvli t0,zero,e8,m1,tu,mu
>> 0x000000400c8ef95c: vle8.v v1,(t2)
>>
>> # vloadmask
>> 0x000000400c8ef960: vsetvli t0,zero,e8,m1,tu,
>> 0x000000400c8ef964: vmsne.vx v0,v1,zero
>>
>> # vmaskcmp_rvv_masked
>> 0x000000400c8ef97c: vsetvli t0,zero,e8,m1,tu,mu
>> 0x000000400c8ef980: vmclr.m v1
>> 0x000000400c8ef984: vmseq.vv v1,v4,v5,v0.t
>> 0x000000400c8ef988: vmv1r.v v0,v1
>>
>> # vstoremask
>> 0x000000400c8ef98c: vsetvli t0,zero,e8,m1,tu,mu
>> 0x000000400c8ef990: vmv.v.x v1,zero
>> 0x000000400c8ef994: vmerge.vim v1,v1,1,v0
>>
>>
>> ## Masked vector arithmetic instructions (e.g. vadd)
>> AddMaskTestMerge case:
>>
>> import jdk.incubator.vector.IntVector;
>> import jdk.incubator.vector.VectorMask;
>> import jdk.incubator.vector.VectorOperators;
>> import jdk.incubator.vector.VectorSpecies;
>>
>> public class AddMaskTestMerge {
>>
>> static final VectorSpecies<Integer> SPECIES = IntVector.SPECIES_128;
>> static final int SIZE = 1024;
>> static int[] a = new int[SIZE];
>> static int[] b = new int[SIZE];
>> static int[] r = new int[SIZE];
>> static boolean[] c = new boolean[]{true,false,true,false,true,false,true,false};
>> static {
>> for (int i = 0; i < SIZE; i++) {
>> a[i] = i;
>> b[i] = i;
>> }
>> }
>>
>> static void workload(int idx) {
>> VectorMask<Integer> vmask = VectorMask.fromArray(SPECIES, c, 0);
>> IntVector av = IntVector.fromArray(SPECIES, a, idx);
>> IntVector bv = IntVector.fromArray(SPECIES, b, idx);
>> av.lanewise(VectorOperators.ADD, bv, vmask).intoArray(r, idx);
>> }
>>
>> public static void main(String[] args) {
>> for (int i = 0; i < 30_0000; i++) {
>> for (int j = 0; j < SIZE; j += SPECIES.length()) {
>> workload(j);
>> }
>> }
>> }
>> }
>>
>>
>> This test case is reduced from existing jtreg vector tests Int128VectorTests.java[2]. This test case corresponds to the add instruction of the vector mask version and other instructions are similar.
>>
>> Before this patch, the compilation log will not print RVV-related instructions. Now the compilation log is as follows:
>>
>>
>> 0ae B10: # out( B25 B11 ) <- in( B9 ) Freq: 0.999991
>> 0ae loadV V1, [R31] # vector (rvv)
>> 0b6 vloadmask V0, V2
>> 0be vadd.vv V3, V1, V0 #@vaddI_masked
>> 0c6 lwu R28, [R7, #124] # loadN, compressed ptr, #@loadN ! Field: AddMaskTestMerge.r
>> 0ca decode_heap_oop R28, R28 #@decodeHeapOop
>> 0cc lwu R7, [R28, #12] # range, #@loadRange
>> 0d0 NullCheck R28
>>
>>
>> And the jit code is as follows:
>>
>>
>> 0x000000400c823cee: vsetvli t0,zero,e32,m1,tu,mu
>> 0x000000400c823cf2: vle32.v v1,(t6) ;*invokestatic store {reexecute=0 rethrow=0 return_oop=0}
>> ; - jdk.incubator.vector.IntVector::intoArray at 43 (line 3228)
>> ; - AddMaskTestMerge::workload at 46 (line 25)
>> 0x000000400c823cf6: vsetvli t0,zero,e8,m1,tu,mu
>> 0x000000400c823cfa: vmsne.vx v0,v2,zero ;*invokestatic load {reexecute=0 rethrow=0 return_oop=0}
>> ; - jdk.incubator.vector.VectorMask::fromArray at 47 (line 208)
>> ; - AddMaskTestMerge::workload at 7 (line 22)
>> 0x000000400c823cfe: vsetvli t0,zero,e32,m1,tu,mu
>> 0x000000400c823d02: vadd.vv v3,v3,v1,v0.t ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0}
>> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 192 (line 834)
>> ; - jdk.incubator.vector.Int128Vector::lanewise at 9 (line 291)
>> ; - jdk.incubator.vector.Int128Vector::lanewise at 4 (line 41)
>> ; - AddMaskTestMerge::workload at 39 (line 25)
>>
>>
>> ## Mask register allocation & mask bit opreation
>> Since v0 is to be used as a mask register in spec[1], sometimes we need two vmask to do the vector mask logical ops like `AndVMask, OrVMask, XorVMask`. And if only v0 and v31 mask registers are defined, the corresponding c2 nodes will not be generated correctly because of the register pressure[3].
>> When we use only v0 and v31 as mask registers, jtreg testing of Byte128VectorTests.java[4] with `-XX:+PrintAssembly` and `-XX:LogFile` will not emit the expected rvv mask instruction. As opposed to this, the following compilation failure log[3] is generated:
>>
>> <intrinsic id='_VectorBinaryOp' nodes='20'/>
>> <method_not_compilable_at_tier level='4'/>
>> <failure reason='failed spill-split-recycle sanity check' phase='compile'/>
>> <failure reason='failed spill-split-recycle sanity check'/>
>> <task_done success='0' nmsize='0' count='22784' stamp='16.146'/>
>>
>>
>> So define v30 and v31 as mask register too and `AndVMask` will emit the C2 JIT code like:
>>
>> vloadmask V0, V1
>> vloadmask V30, V2
>> vmask_and V0, V30, V0
>>
>> We also modified the implementation of `spill_copy_vector_stack_to_stack ` so that it no longer occupies the v0 register. In addition to that, we change some node like `vasr/vlsl/vlsr/vstring_x/varray_x/vclearArray_x`, which use v0 internally, to make C2 to sense that they used v0.
>>
>> ## vector load/store - predicated & blend opreation
>>
>> Jtreg testing of Byte128VectorTests.java[4] with -XX:+PrintOptoAssembly and -XX:LogFile will print the following compilation log, which generated by predicated vector load/store:
>>
>> 152 B21: # out( B22 ) <- in( B20 ) Freq: 0.499984
>> 152 vmask_gen_L V0, R12
>> 162 loadV_masked V1, V0, [R10]
>> 16e storeV_masked [R11], V0, V1
>>
>>
>> And `VectorBlend` will generate the following compilation log (part of rotate opreation):
>>
>> 1ea vlsrBS V6, V1, V3 V0
>> 1fe vlslBS V5, V1, V2 V0
>> 212 vor.vv V2, V5, V6 #@vor
>> 21a vloadmask V0, V4
>> 222 vmerge_vvm V1, V1, V2 # vector blend
>> 22a bgeu R9, R30, B56 #@cmpU_branch P=0.000001 C=-1.000000
>>
>>
>> At the same time, we added the predicated nodes of `RShiftV/LShiftV/URShiftV`. While there was some code duplication for the corresponding nodes in non-masked form, so a small refactoring was done.
>>
>>
>> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc
>> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int128VectorTests.java
>> [3] https://github.com/openjdk/jdk/blob/0deb648985b018653ccdaf193dc13b3cf21c088a/src/hotspot/share/opto/chaitin.cpp#L526
>> [4] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Byte128VectorTests.java
>>
>> ### Testing:
>>
>> qemu with UseRVV:
>> - [x] Tier1 tests (release)
>> - [x] Tier2 tests (release)
>> - [x] Tier3 tests (release)
>> - [x] test/jdk/jdk/incubator/vector (release/fastdebug)
>
> Dingli Zhang has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix match_rule_supported_vector_masked
Marked as reviewed by yzhu (Author).
-------------
PR Review: https://git.openjdk.org/jdk/pull/12682#pullrequestreview-1389420655
More information about the hotspot-compiler-dev
mailing list