RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API
Gui Cao
gcao at openjdk.org
Fri Nov 4 10:05:20 UTC 2022
On Fri, 4 Nov 2022 09:08:38 GMT, Eric Liu <eliu at openjdk.org> wrote:
>> Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1].
>>
>> For example, AndReductionV is implemented as follows:
>>
>>
>> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad
>> index 0ef36fdb292..c04962993c0 100644
>> --- a/src/hotspot/cpu/riscv/riscv_v.ad
>> +++ b/src/hotspot/cpu/riscv/riscv_v.ad
>> @@ -63,7 +63,6 @@ source %{
>> case Op_ExtractS:
>> case Op_ExtractUB:
>> // Vector API specific
>> - case Op_AndReductionV:
>> case Op_OrReductionV:
>> case Op_XorReductionV:
>> case Op_LoadVectorGather:
>> @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{
>> ins_pipe(pipe_slow);
>> %}
>>
>> +// vector and reduction
>> +
>> +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{
>> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT);
>> + match(Set dst (AndReductionV src1 src2));
>> + effect(TEMP tmp);
>> + ins_cost(VEC_COST);
>> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t"
>> + "vredand.vs $tmp, $src2, $tmp\n\t"
>> + "vmv.x.s $dst, $tmp" %}
>> + ins_encode %{
>> + __ vsetvli(t0, x0, Assembler::e32);
>> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
>> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> + as_VectorRegister($tmp$$reg));
>> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
>> + %}
>> + ins_pipe(pipe_slow);
>> +%}
>> +
>> +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{
>> + predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG);
>> + match(Set dst (AndReductionV src1 src2));
>> + effect(TEMP tmp);
>> + ins_cost(VEC_COST);
>> + format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t"
>> + "vredand.vs $tmp, $src2, $tmp\n\t"
>> + "vmv.x.s $dst, $tmp" %}
>> + ins_encode %{
>> + __ vsetvli(t0, x0, Assembler::e64);
>> + __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
>> + __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> + as_VectorRegister($tmp$$reg));
>> + __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
>> + %}
>>
>>
>>
>> After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly.
>>
>> By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows:
>>
>>
>> 2a8 B22: # out( B14 B23 ) <- in( B21 B31 ) Freq: 32.1131
>> 2a8 lwu R28, [R9, #8] # loadNKlass, compressed class ptr, #@loadNKlass
>> 2ac decode_klass_not_null R14, R28 #@decodeKlass_not_null
>> 2b8 ld R30, [R14, #40] # class, #@loadKlass
>> 2bc li R7, #-1 # int, #@loadConI
>> 2c0 vmv.s.x V1, R7 #@reduce_andI
>> vredand.vs V1, V2, V1
>> vmv.x.s R28, V1
>> 2d0 mv R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact * # ptr, #@loadConP
>> 2e8 beq R30, R7, B14 #@cmpP_branch P=0.830000 C=-1.000000
>>
>>
>> There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4]
>>
>> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations
>> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests
>> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests
>> [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md
>>
>> ## Testing:
>> - hotspot and jdk tier1 on unmatched board without new failures
>> - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu
>> - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu
>
> src/hotspot/cpu/riscv/riscv_v.ad line 838:
>
>> 836: __ vredor_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> 837: as_VectorRegister($tmp$$reg));
>> 838: __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
>
> This is basically a shared code pattern for OrReductionV, AndReductionV, XorReduction. Maybe a common method can help to simplify the code.
@TheShermanTanker I get it. I think that AddReductionVI and AddReductionVL can also be simplified as above. I will submit a new PR after testing, and provide a general simplified method in it.
-------------
PR: https://git.openjdk.org/jdk/pull/10691
More information about the hotspot-compiler-dev
mailing list