RFR: 8295261: RISC-V: Support ReductionV instructions for Vector API [v6]

Sat Nov 5 14:52:22 UTC 2022

> Currently, certain vector-specific instructions in c2 are not implemented in RISC-V. This patch will add support of `AndReductionV`, `OrReductionV`, `XorReductionV` for RISC-V. This patch was implemented by referring to the sve version of aarch64 and riscv-v-spec v1.0 [1]. 
> 
> For example, AndReductionV is implemented as follows:
> 
> 
> diff --git a/src/hotspot/cpu/riscv/riscv_v.ad b/src/hotspot/cpu/riscv/riscv_v.ad
> index 0ef36fdb292..c04962993c0 100644
> --- a/src/hotspot/cpu/riscv/riscv_v.ad
> +++ b/src/hotspot/cpu/riscv/riscv_v.ad
> @@ -63,7 +63,6 @@ source %{
>        case Op_ExtractS:
>        case Op_ExtractUB:
>        // Vector API specific
> -      case Op_AndReductionV:
>        case Op_OrReductionV:
>        case Op_XorReductionV:
>        case Op_LoadVectorGather:
> @@ -785,6 +784,120 @@ instruct vnegD(vReg dst, vReg src) %{
>    ins_pipe(pipe_slow);
>  %}
>  
> +// vector and reduction
> +
> +instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{
> +  predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT);
> +  match(Set dst (AndReductionV src1 src2));
> +  effect(TEMP tmp);
> +  ins_cost(VEC_COST);
> +  format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t"
> +            "vredand.vs $tmp, $src2, $tmp\n\t"
> +            "vmv.x.s  $dst, $tmp" %}
> +  ins_encode %{
> +    __ vsetvli(t0, x0, Assembler::e32);
> +    __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
> +    __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
> +                  as_VectorRegister($tmp$$reg));
> +    __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
> +  %}
> +  ins_pipe(pipe_slow);
> +%}
> +
> +instruct reduce_andL(iRegLNoSp dst, iRegL src1, vReg src2, vReg tmp) %{
> +  predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_LONG);
> +  match(Set dst (AndReductionV src1 src2));
> +  effect(TEMP tmp);
> +  ins_cost(VEC_COST);
> +  format %{ "vmv.s.x $tmp, $src1\t#@reduce_andL\n\t"
> +            "vredand.vs $tmp, $src2, $tmp\n\t"
> +            "vmv.x.s  $dst, $tmp" %}
> +  ins_encode %{
> +    __ vsetvli(t0, x0, Assembler::e64);
> +    __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
> +    __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
> +                  as_VectorRegister($tmp$$reg));
> +    __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
> +  %}
> 
> 
> 
> After this patch, Vector API can use RVV with the `-XX:+UseRVV` parameter when executing java programs on the RISC-V RVV 1.0 platform. Tests [2] and [3] can be used to test the implementation of this node and it passes the tests properly. 
> 
> By adding the `-XX:+PrintAssembly -Xcomp -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing the test case, hsdis is currently unable to decompile rvv's assembly instructions. The relevant OptoAssembly log output in the compilation log is as follows:
> 
> 
> 2a8     B22: #	out( B14 B23 ) <- in( B21 B31 )  Freq: 32.1131
> 2a8     lwu  R28, [R9, #8]	# loadNKlass, compressed class ptr, #@loadNKlass
> 2ac     decode_klass_not_null  R14, R28	#@decodeKlass_not_null
> 2b8     ld  R30, [R14, #40]	# class, #@loadKlass
> 2bc     li R7, #-1	# int, #@loadConI
> 2c0     vmv.s.x V1, R7	#@reduce_andI
> 	vredand.vs V1, V2, V1
> 	vmv.x.s  R28, V1
> 2d0     mv  R7, precise jdk/internal/vm/vector/VectorSupport$ReductionOperation: 0x000000408c4f6220:Constant:exact *	# ptr, #@loadConP
> 2e8     beq  R30, R7, B14	#@cmpP_branch  P=0.830000 C=-1.000000
> 
> 
> There is no hardware implementation of RISC-V RVV 1.0, so the tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. The execution of `ANDReduceInt256VectorTests` and `ANDReduceLong256VectorTests` test cases under qemu, with `-XX:+UseRVV` turned on, can reduce the execution time of this method by about 50.7% compared to the RVV version without this node implemented. After implementing this node, by comparing the influence of the number of C2 assembly instructions before and after the -XX:+UseRVV parameter is enabled, after enabling -XX:+UseRVV, the number of assembly instructions is reduced by about 50% [4]
> 
> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations
> [2] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests
> [3] https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Long256VectorTests.java#ANDReduceLong256VectorTests
> [4] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md
> 
> ## Testing:
> - hotspot and jdk tier1 on unmatched board without new failures
> - test/jdk/jdk/incubator/vector/Int256VectorTests.java with fastdebug on qemu
> - test/jdk/jdk/incubator/vector/Long256VectorTests.java with fastdebug on qemu

Gui Cao has updated the pull request incrementally with one additional commit since the last revision:

  Format code

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/10691/files
  - new: https://git.openjdk.org/jdk/pull/10691/files/3e31a773..a7db305d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=05
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10691&range=04-05

  Stats: 15 lines in 3 files changed: 1 ins; 0 del; 14 mod
  Patch: https://git.openjdk.org/jdk/pull/10691.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/10691/head:pull/10691

PR: https://git.openjdk.org/jdk/pull/10691