RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API

Fri Nov 4 06:23:30 UTC 2022

On Fri, 4 Nov 2022 02:23:30 GMT, Yadong Wang <yadongwang at openjdk.org> wrote:

>> Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. 
>> 
>> For example, AndReductionV is implemented as follows:
>> 
>> 
>> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad
>> index 0ef36fdb292..c04962993c0 100644
>> --- a/src/hotspot/cpu/riscv/riscv_v.ad
>> +++ b/src/hotspot/cpu/riscv/riscv_v.ad
>> @@ -63,7 +63,6 @@ source %{
>>        case Op_ExtractS:
>>        case Op_ExtractUB:
>>        // Vector API specific
>> -      case Op_AndReductionV:
>>        case Op_OrReductionV:
>>        case Op_XorReductionV:
>>        case Op_LoadVectorGather:
>> @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{
>>    ins_pipe(pipe_slow);
>>  %}
>>  
>> +// vector and reduction
>> +
>> +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{
>> +  predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT);
>> +  match(Set dst (AndReductionV src1 src2));
>> +  effect(TEMP tmp);
>> +  ins_cost(VEC_COST);
>> +  format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t"
>> +            "vredand.vs $tmp, $src2, $tmp\n\t"
>> +            "vmv.x.s  $dst, $tmp" %}
>> +  ins_encode %{
>> +    __ vsetvli(t0, x0, Assembler::e32);
>> +    __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
>> +    __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> +                  as_VectorRegister($tmp$$reg));
>> +    __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
>> +  %}
>> +  ins_pipe(pipe_slow);
>> +%}
>> +
>> +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{
>> +  predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG);
>> +  match(Set dst (AndReductionV src1 src2));
>> +  effect(TEMP tmp);
>> +  ins_cost(VEC_COST);
>> +  format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t"
>> +            "vredand.vs $tmp, $src2, $tmp\n\t"
>> +            "vmv.x.s  $dst, $tmp" %}
>> +  ins_encode %{
>> +    __ vsetvli(t0, x0, Assembler::e64);
>> +    __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
>> +    __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> +                  as_VectorRegister($tmp$$reg));
>> +    __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
>> +  %}
>> 
>> 
>> 
>> After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. 
>> 
>> By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows:
>> 
>> 
>> 2a8     B22: #	out( B14 B23 ) <- in( B21 B31 )  Freq: 32.1131
>> 2a8     lwu  R28, [R9, #8]	# loadNKlass, compressed class ptr, #@loadNKlass
>> 2ac     decode_klass_not_null  R14, R28	#@decodeKlass_not_null
>> 2b8     ld  R30, [R14, #40]	# class, #@loadKlass
>> 2bc     li R7, #-1	# int, #@loadConI
>> 2c0     vmv.s.x V1, R7	#@reduce_andI
>> 	vredand.vs V1, V2, V1
>> 	vmv.x.s  R28, V1
>> 2d0     mv  R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact *	# ptr, #@loadConP
>> 2e8     beq  R30, R7, B14	#@cmpP_branch  P=0.830000 C=-1.000000
>> 
>> 
>> There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4]
>> 
>> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations
>> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests
>> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests
>> [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md
>> 
>> ## Testing:
>> - hotspot and jdk tier1 on unmatched board without new failures
>> - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu
>> - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu
>
> src/hotspot/cpu/riscv/riscv_v.ad line 814:
> 
>> 812:             "vmv.x.s  $dst, $tmp" %}
>> 813:   ins_encode %{
>> 814:     __ vsetvli(t0, x0, Assembler::e64);
> 
> Only the element basic type of the two code segments is different. Could you use Matcher::vector_element_basic_type() to simplify the code?

@yadongw Hello, thanks for review. the current definition of AndReductionV node of riscv refers to the AndReductionV node of aarch64 and the AddReductionVI, AddReductionVL of riscv. The parameter types of the nodes here are different. At present, Matcher::vector_element_basic_type() should not be used to simplify the code.

For example, the AndReductionV node of aarch64 defines the parameter types as follows:

instruct reduce_andI_sve(iRegINoSp dst, iRegIorL2I isrc, vReg vsrc, vRegD tmp)

instruct reduce_andL_sve(iRegLNoSp dst, iRegL isrc, vReg vsrc, vRegD tmp)

riscv's AddReductionVI, AddReductionVL node defines the parameter types as follows:

instruct reduce_addI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp)

instruct reduce_addL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp)

-------------

PR: https://git.openjdk.org/jdk/pull/10691