RFR(M): 8158232: PPC64: improve byte, int and long array copy stubs by using VSX instructions

Wed Jun 1 22:12:01 UTC 2016

Hi Michihiro

A few things that come to my mind that could help address the questions
raised by Goetz:

* I could not see, when implementing the short case, any gain by
unrolling the tight loop;
* I could see that setting an aggressive prefetch did help a lot;
* I think that aligning the backbranch target at 16-byte at least is
the right thing to do, since according to [1]:

"Instructions read out of the I-cache are forwarded to the IBuffer as a
staging area for group formation. The IBuffer is arranged as a register
file where each row can hold up to four instructions (16-byte aligned
from the I-cache)"

And a nit: add space, add upper case 'C', fix typo in "byte", and add an
ending dot on:

//copy 16 elements (total 128 byte) a time

Regards,
Gustavo

[1] POWER8 Processor User’s Manual for the Single-Chip Module,
10 March 2015, Version 1.11, p. 207, section 10.1.6.

On 31-05-2016 12:36, Michihiro Horie wrote:
> 
> Dear all,
> 
> Could you please review the following webrev?
> 
> http://cr.openjdk.java.net/~mdoerr/8158232_PPC_vsx_copy/webrev.00/
> 
> This change improves performance of disjoint arraycopy of byte, int, and
> long by using VSX load/store instructions.
> 
> Discussion started from:
> http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-May/002483.html
> 
> Performance improvement with micro benchmarks is shown in:
> http://mail.openjdk.java.net/pipermail/ppc-aix-port-dev/2016-May/002531.html
> 
> Thank you very much,
> 
> Best regards,
> --
> Michihiro Horie,
> IBM Research - Tokyo
>