Speeding up copy_memory stub

Tue Nov 8 16:10:19 UTC 2022

Hello
I have found the issue, it was about what the code does after copy32 loop ends:
    __ beqz(cnt, done); //if that's all - done

    __ addi(tmp4, cnt, -8); // if not - copy the reminder
    __ bltz(tmp4, copy_small); // cnt < 8, go to copy_small, else fall throught to copy8

when beqz(cnt,done) was after bltz(tmp4, copy_small)  then it would never be called and it will copy more than required.
Moving beqz before bltz I have fixed the issue.
I have updated the commit for anyone to try this [1]
I’ll do some tests, if no issues found - will submit PR.
I also want to get rid of data dependency here:
    __ addi(cnt, cnt, -wordSize*4);
    __ addi(tmp4, cnt, -32);
    __ bgez(tmp4, copy32); // cnt >= 32, do next loop

by making it this way:

    __ addi(tmp4, cnt, -(32+wordSize*4));
    __ addi(cnt, cnt, -wordSize*4);
    __ bgez(tmp4, copy32); // cnt >= 32, do next loop

this way it will make two addi instructions independent of each other and allow them to be scheduled concurrently. 
I have tested this change independently of the rest of this patch and found no perf difference on three different uarches. ( 1 inOrder, 2 OoO)

Any comments are welcome.

Regards, Vladimir

[1] https://github.com/VladimirKempik/jdk/commit/06d21c7f583b19009b5ac1f63462475d264257a4
> Hello.
> Currently ( if RVV is not used), we doing copy_memory not so great.
> At best doing just 8 bytes per loop ( copy8 label, one ld, one sd)
> 
> I propose we use faster version when possible: 
> using 4 ld in a row then 4 sd. Copying 32 bytes per loop, similar to [1]
> 
> I have made a prototype [2], check the copy32 label there. It also have some comments on other parts of copy_memory stub
> Here are results of jmh testing on rvb-ice thead c910 board:
> 
> Before ( copy8 only )
> Benchmark                           (size)   Mode  Cnt     Score     Error     Units
> ArrayCopyObject.conjoint_micro        31  thrpt   25  6653.095 ± 251.565  ops/ms
> ArrayCopyObject.conjoint_micro        63  thrpt   25  4933.970 ±  77.559   ops/ms
> ArrayCopyObject.conjoint_micro      127  thrpt   25  3627.454 ±  34.589   ops/ms
> ArrayCopyObject.conjoint_micro    2047  thrpt   25   368.249 ±   0.453     ops/ms
> ArrayCopyObject.conjoint_micro    4095  thrpt   25   187.776 ±   0.306     ops/ms
> ArrayCopyObject.conjoint_micro    8191  thrpt   25    94.477 ±   0.340      ops/ms
> 
> after ( with copy32 )
> 
> ArrayCopyObject.conjoint_micro        31  thrpt   25  7620.546 ±  69.756  ops/ms
> ArrayCopyObject.conjoint_micro        63  thrpt   25  6677.978 ±  33.112  ops/ms
> ArrayCopyObject.conjoint_micro      127  thrpt   25  5206.973 ±  22.612  ops/ms
> ArrayCopyObject.conjoint_micro    2047  thrpt   25   653.655 ±  31.494   ops/ms
> ArrayCopyObject.conjoint_micro    4095  thrpt   25   352.905 ±   7.390    ops/ms
> ArrayCopyObject.conjoint_micro    8191  thrpt   25   165.127 ±   0.832    ops/ms
> 
> However I still have some issues with the code, when copy mode is (!is_aligned and !is_backward) - I’m getting ClassNotFound exceptions from classLoader, while trying to run JMH tests.
> I think it’s related to my patch, I have made a simple workaround for this case [3] to be able to make some measurements.
> 
> Any help on catching these bugs is highly appreciated.
> 
> Best Regards, Vladimir.
> [1] https://github.com/eblot/newlib/blob/master/newlib/libc/string/memcpy.c
> [2] https://github.com/VladimirKempik/jdk/commit/e113d454dc2808889906eceaa1fb9cd560140fbc
> [3] https://github.com/VladimirKempik/jdk/commit/e113d454dc2808889906eceaa1fb9cd560140fbc#r89241535