RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API

Gui Cao gcao at openjdk.org
Fri Nov 4 10:05:20 UTC 2022


On Fri, 4 Nov 2022 09:08:38 GMT, Eric Liu <eliu at openjdk.org> wrote:

>> Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. 
>> 
>> For example, AndReductionV is implemented as follows:
>> 
>> 
>> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad
>> index 0ef36fdb292..c04962993c0 100644
>> --- a/src/hotspot/cpu/riscv/riscv_v.ad
>> +++ b/src/hotspot/cpu/riscv/riscv_v.ad
>> @@ -63,7 +63,6 @@ source %{
>>        case Op_ExtractS:
>>        case Op_ExtractUB:
>>        // Vector API specific
>> -      case Op_AndReductionV:
>>        case Op_OrReductionV:
>>        case Op_XorReductionV:
>>        case Op_LoadVectorGather:
>> @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{
>>    ins_pipe(pipe_slow);
>>  %}
>>  
>> +// vector and reduction
>> +
>> +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{
>> +  predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT);
>> +  match(Set dst (AndReductionV src1 src2));
>> +  effect(TEMP tmp);
>> +  ins_cost(VEC_COST);
>> +  format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t"
>> +            "vredand.vs $tmp, $src2, $tmp\n\t"
>> +            "vmv.x.s  $dst, $tmp" %}
>> +  ins_encode %{
>> +    __ vsetvli(t0, x0, Assembler::e32);
>> +    __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
>> +    __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> +                  as_VectorRegister($tmp$$reg));
>> +    __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
>> +  %}
>> +  ins_pipe(pipe_slow);
>> +%}
>> +
>> +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{
>> +  predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG);
>> +  match(Set dst (AndReductionV src1 src2));
>> +  effect(TEMP tmp);
>> +  ins_cost(VEC_COST);
>> +  format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t"
>> +            "vredand.vs $tmp, $src2, $tmp\n\t"
>> +            "vmv.x.s  $dst, $tmp" %}
>> +  ins_encode %{
>> +    __ vsetvli(t0, x0, Assembler::e64);
>> +    __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
>> +    __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> +                  as_VectorRegister($tmp$$reg));
>> +    __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
>> +  %}
>> 
>> 
>> 
>> After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. 
>> 
>> By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows:
>> 
>> 
>> 2a8     B22: #	out( B14 B23 ) <- in( B21 B31 )  Freq: 32.1131
>> 2a8     lwu  R28, [R9, #8]	# loadNKlass, compressed class ptr, #@loadNKlass
>> 2ac     decode_klass_not_null  R14, R28	#@decodeKlass_not_null
>> 2b8     ld  R30, [R14, #40]	# class, #@loadKlass
>> 2bc     li R7, #-1	# int, #@loadConI
>> 2c0     vmv.s.x V1, R7	#@reduce_andI
>> 	vredand.vs V1, V2, V1
>> 	vmv.x.s  R28, V1
>> 2d0     mv  R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact *	# ptr, #@loadConP
>> 2e8     beq  R30, R7, B14	#@cmpP_branch  P=0.830000 C=-1.000000
>> 
>> 
>> There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4]
>> 
>> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations
>> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests
>> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests
>> [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md
>> 
>> ## Testing:
>> - hotspot and jdk tier1 on unmatched board without new failures
>> - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu
>> - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu
>
> src/hotspot/cpu/riscv/riscv_v.ad line 838:
> 
>> 836:     __ vredor_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> 837:                   as_VectorRegister($tmp$$reg));
>> 838:     __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
> 
> This is basically a shared code pattern for OrReductionV, AndReductionV, XorReduction.  Maybe a common method can help to simplify the code.

@TheShermanTanker  I get it. I think that AddReductionVI and AddReductionVL can also be simplified as above. I will submit a new PR after testing, and provide a general simplified method in it.

-------------

PR: https://git.openjdk.org/jdk/pull/10691


More information about the hotspot-compiler-dev mailing list