Speeding up copy_memory stub
Vladimir Kempik
vladimir.kempik at gmail.com
Tue Nov 8 16:10:19 UTC 2022
Hello
I have found the issue, it was about what the code does after copy32 loop ends:
__ beqz(cnt, done); //if that's all - done
__ addi(tmp4, cnt, -8); // if not - copy the reminder
__ bltz(tmp4, copy_small); // cnt < 8, go to copy_small, else fall throught to copy8
when beqz(cnt,done) was after bltz(tmp4, copy_small) then it would never be called and it will copy more than required.
Moving beqz before bltz I have fixed the issue.
I have updated the commit for anyone to try this [1]
I’ll do some tests, if no issues found - will submit PR.
I also want to get rid of data dependency here:
__ addi(cnt, cnt, -wordSize*4);
__ addi(tmp4, cnt, -32);
__ bgez(tmp4, copy32); // cnt >= 32, do next loop
by making it this way:
__ addi(tmp4, cnt, -(32+wordSize*4));
__ addi(cnt, cnt, -wordSize*4);
__ bgez(tmp4, copy32); // cnt >= 32, do next loop
this way it will make two addi instructions independent of each other and allow them to be scheduled concurrently.
I have tested this change independently of the rest of this patch and found no perf difference on three different uarches. ( 1 inOrder, 2 OoO)
Any comments are welcome.
Regards, Vladimir
[1] https://github.com/VladimirKempik/jdk/commit/06d21c7f583b19009b5ac1f63462475d264257a4
> Hello.
> Currently ( if RVV is not used), we doing copy_memory not so great.
> At best doing just 8 bytes per loop ( copy8 label, one ld, one sd)
>
> I propose we use faster version when possible:
> using 4 ld in a row then 4 sd. Copying 32 bytes per loop, similar to [1]
>
> I have made a prototype [2], check the copy32 label there. It also have some comments on other parts of copy_memory stub
> Here are results of jmh testing on rvb-ice thead c910 board:
>
> Before ( copy8 only )
> Benchmark (size) Mode Cnt Score Error Units
> ArrayCopyObject.conjoint_micro 31 thrpt 25 6653.095 ± 251.565 ops/ms
> ArrayCopyObject.conjoint_micro 63 thrpt 25 4933.970 ± 77.559 ops/ms
> ArrayCopyObject.conjoint_micro 127 thrpt 25 3627.454 ± 34.589 ops/ms
> ArrayCopyObject.conjoint_micro 2047 thrpt 25 368.249 ± 0.453 ops/ms
> ArrayCopyObject.conjoint_micro 4095 thrpt 25 187.776 ± 0.306 ops/ms
> ArrayCopyObject.conjoint_micro 8191 thrpt 25 94.477 ± 0.340 ops/ms
>
> after ( with copy32 )
>
> ArrayCopyObject.conjoint_micro 31 thrpt 25 7620.546 ± 69.756 ops/ms
> ArrayCopyObject.conjoint_micro 63 thrpt 25 6677.978 ± 33.112 ops/ms
> ArrayCopyObject.conjoint_micro 127 thrpt 25 5206.973 ± 22.612 ops/ms
> ArrayCopyObject.conjoint_micro 2047 thrpt 25 653.655 ± 31.494 ops/ms
> ArrayCopyObject.conjoint_micro 4095 thrpt 25 352.905 ± 7.390 ops/ms
> ArrayCopyObject.conjoint_micro 8191 thrpt 25 165.127 ± 0.832 ops/ms
>
> However I still have some issues with the code, when copy mode is (!is_aligned and !is_backward) - I’m getting ClassNotFound exceptions from classLoader, while trying to run JMH tests.
> I think it’s related to my patch, I have made a simple workaround for this case [3] to be able to make some measurements.
>
> Any help on catching these bugs is highly appreciated.
>
> Best Regards, Vladimir.
> [1] https://github.com/eblot/newlib/blob/master/newlib/libc/string/memcpy.c
> [2] https://github.com/VladimirKempik/jdk/commit/e113d454dc2808889906eceaa1fb9cd560140fbc
> [3] https://github.com/VladimirKempik/jdk/commit/e113d454dc2808889906eceaa1fb9cd560140fbc#r89241535
More information about the riscv-port-dev
mailing list