RFR: 8309419: RISC-V: Relax register constraint for AddReductionVF & AddReductionVD nodes
Fei Yang
fyang at openjdk.org
Mon Jun 5 08:54:05 UTC 2023
On Mon, 5 Jun 2023 06:09:55 GMT, Gui Cao <gcao at openjdk.org> wrote:
> Hi, We note that in the C2 AddReductionVF & AddReductionVD node, the src1 and dst registers are constrained to be the same register, which is not required, so we relax the register constraint for AddReductionVF/AddReductionVD in the C2 node. For reference, other CPUs, such as x86 and arm neon, do not need the same registers either[1]. arm64 sve constrains them to be the same registers because of the use of the FADDA instruction[2], which is floating point adding all active channels of SIMD&FP scalar sources and vector sources and placing the result in SIMD&FP scalar source registers. So for arm64 sve, it is required that that the two registers be the same.
>
> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2897-L2907
> [2] https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/FADDA--Floating-point-add-strictly-ordered-reduction--accumulating-in-scalar-
>
> ### AddReductionVF/AddReductionVD
> We can use Float256VectorTests.java Double256VectorTests.java to
> emit these nodes and the compilation log is as follows:
> #### AddReductionVF
> Before this patch:
>
> 0f6 B15: # out( B61 B16 ) <- in( B14 ) Freq: 55.8033
> 0f6 # castII of R19, #@castII
> 0f6 addw R10, R19, zr #@convI2L_reg_reg
> 0fa slli R10, R10, (#2 & 0x3f) #@lShiftL_reg_imm
> 0fc add R11, R31, R10 # ptr, #@addP_reg_reg
> 100 addi R11, R11, #16 # ptr, #@addP_reg_imm
> 102 loadV V1, [R11] # vector (rvv)
> 10a spill F0 -> F1 # spill size = 32
> 10e reduce_addF F1, F1, V1 # KILL V2
> 11e bgeu R19, R29, B61 #@cmpU_branch P=0.000001 C=-1.000000
>
> After this patch(Saving a spill operation):
>
> 0f6 B15: # out( B61 B16 ) <- in( B14 ) Freq: 55.8033
> 0f6 # castII of R19, #@castII
> 0f6 addw R10, R19, zr #@convI2L_reg_reg
> 0fa slli R10, R10, (#2 & 0x3f) #@lShiftL_reg_imm
> 0fc add R11, R31, R10 # ptr, #@addP_reg_reg
> 100 addi R11, R11, #16 # ptr, #@addP_reg_imm
> 102 loadV V1, [R11] # vector (rvv)
> 10a reduce_addF F1, F0, V1 # KILL V2
> 11a bgeu R19, R29, B61 #@cmpU_branch P=0.000001 C=-1.000000
>
> #### AddReductionVD
> Before this patch:
>
> 0f4 B15: # out( B61 B16 ) <- in( B14 ) Freq: 55.8033
> 0f4 # castII of R9, #@castII
> 0f4 addw R10, R9, zr #@convI2L_reg_reg
> 0f8 slli R10, R10, (#3 & 0x3f) #@lShiftL_reg_imm
> 0fa add R11, R30, R10 # ptr, #@addP_reg_reg
> 0fe addi R11, R11, #16 # ptr, #@addP_reg_imm
> 100 loadV V1, [R11] # ve...
Looks reasonable to me. Thanks.
-------------
Marked as reviewed by fyang (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/14308#pullrequestreview-1462046926
More information about the hotspot-compiler-dev
mailing list