RFR: 8296602: RISC-V: improve performance of copy_memory stub [v2]
Fei Yang
fyang at openjdk.org
Thu Nov 17 07:34:31 UTC 2022
On Wed, 9 Nov 2022 11:34:43 GMT, Vladimir Kempik <vkempik at openjdk.org> wrote:
>> Please review this change to improve the performance of copy_memory stub on risc-v
>>
>> This change has three parts
>> 1) use copy32 if possible to do 4 ld and 4 st per loop cycle
>> 2) don't produce precopy code if is_aligned is true, it's not executed.
>> 3) in the end of loop8 and loop32, remove data dependency between two addi opcodes, to allow them to be scheduled simultaneously
>>
>> testing: org.openjdk.bench.vm.compiler.ArrayCopyObject, hotspot_compiler_arraycopy, hotspot:tier1, hotspot:tier2 - all ok
>> hotspot:tier2 is on the way.
>>
>> and for the benchmark results, using
>> org.openjdk.bench.vm.compiler.ArrayCopyObject.conjoint_micro
>>
>> thead rvb-ice c910
>> thead
>>
>> Before ( copy8 only )
>> Benchmark (size) Mode Cnt Score Error Units
>> ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ± 251.565 ops/ms
>> ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ± 77.559 ops/ms
>> ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ± 34.589 ops/ms
>> ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ± 0.453 ops/ms
>> ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ± 0.306 ops/ms
>> ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ± 0.340 ops/ms
>>
>> after ( with copy32 )
>> ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ± 69.756 ops/ms
>> ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ± 33.112 ops/ms
>> ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ± 22.612 ops/ms
>> ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ± 31.494 ops/ms
>> ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ± 7.390 ops/ms
>> ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ± 0.832 ops/ms
>>
>> after ( copy32 with dead code elimination and independent addis )
>> ArrayCopyObject.conjoint_micro 31 thrpt 25 7576.346 ? 94.487 ops/ms
>> ArrayCopyObject.conjoint_micro 63 thrpt 25 6475.730 ? 252.590 ops/ms
>> ArrayCopyObject.conjoint_micro 127 thrpt 25 5221.764 ? 20.415 ops/ms
>> ArrayCopyObject.conjoint_micro 2047 thrpt 25 691.847 ? 1.102 ops/ms
>> ArrayCopyObject.conjoint_micro 4095 thrpt 25 360.269 ? 1.091 ops/ms
>> ArrayCopyObject.conjoint_micro 8191 thrpt 25 179.733 ? 3.012 ops/ms
>>
>> on hifive unmatched:
>>
>> before:
>> Benchmark (size) Mode Cnt Score Error Units
>> ArrayCopyObject.conjoint_micro 31 thrpt 25 5391.575 ± 152.984 ops/ms
>> ArrayCopyObject.conjoint_micro 63 thrpt 25 3700.946 ± 43.175 ops/ms
>> ArrayCopyObject.conjoint_micro 127 thrpt 25 2316.160 ± 24.734 ops/ms
>> ArrayCopyObject.conjoint_micro 2047 thrpt 25 188.616 ± 0.151 ops/ms
>> ArrayCopyObject.conjoint_micro 4095 thrpt 25 95.323 ± 0.053 ops/ms
>> ArrayCopyObject.conjoint_micro 8191 thrpt 25 46.935 ± 0.041 ops/ms
>>
>> after:
>> Benchmark (size) Mode Cnt Score Error Units
>> ArrayCopyObject.conjoint_micro 31 thrpt 25 6136.169 ± 330.409 ops/ms
>> ArrayCopyObject.conjoint_micro 63 thrpt 25 4924.020 ± 78.529 ops/ms
>> ArrayCopyObject.conjoint_micro 127 thrpt 25 3732.561 ± 89.606 ops/ms
>> ArrayCopyObject.conjoint_micro 2047 thrpt 25 431.103 ± 0.505 ops/ms
>> ArrayCopyObject.conjoint_micro 4095 thrpt 25 221.543 ± 0.363 ops/ms
>> ArrayCopyObject.conjoint_micro 8191 thrpt 25 100.586 ± 0.197 ops/ms
>
> Vladimir Kempik has updated the pull request incrementally with one additional commit since the last revision:
>
> remove excessive comments
Nice numbers! Overall looks good to me. Several minor nits.
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 902:
> 900: * if ((dst % 8) == (src % 8)) {
> 901: * aligned;
> 902: * goto copy_big;
You might want to update code comment at line 884 too.
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1042:
> 1040: __ bind(copy32);
> 1041: if (is_backwards) {
> 1042: __ addi(src, src, -wordSize*4);
Can you leave a space before and after the operator here and other places?
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1059:
> 1057: __ addi(dst, dst, wordSize*4);
> 1058: }
> 1059: __ addi(tmp4, cnt, -(32+wordSize*4));
Can we use 'tmp' instead of 'tmp4' here? Then it will be consistent in register usage with other places where we check 'cnt'.
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1065:
> 1063: __ beqz(cnt, done); // if that's all - done
> 1064:
> 1065: __ addi(tmp4, cnt, -8); // if not - copy the reminder
Similar here.
src/hotspot/cpu/riscv/stubGenerator_riscv.cpp line 1079:
> 1077: __ addi(dst, dst, wordSize);
> 1078: }
> 1079: __ addi(tmp4, cnt, -(8+wordSize));
Similar here.
-------------
PR: https://git.openjdk.org/jdk/pull/11058
More information about the hotspot-compiler-dev
mailing list