[aarch64-port-dev ] arraycopy optimisations on aarch64

Edward Nevill edward.nevill at gmail.com
Wed Feb 17 19:29:18 UTC 2016


Hi,

There have been a number of ongoing efforts at optimising array copy recently.

Rather than have multiple webrevs and multiple JIRA issues I would like to collect all the efforts under a single JIRA issue. I have created the following JIRA issue for all work relating to optimising array copy.

https://bugs.openjdk.java.net/browse/JDK-8150082

We can then review all the array copy optimisation proposals on the aarch64-port-dev mailing list rather than cc'ing the whole of hotspot-compiler-dev with every intricate detail of array copys on aarch64.

Once we have a complete version of array copy code we are happy with I can submit a single CR for review. All contributions will be acknowledged in the "Contributed-by" section.

To further muddy the waters I have two patches I would like to forward for your discussion.

1) http://cr.openjdk.java.net/~enevill/memopts/small.patch

This improves the performance of copying small (0 to 80 bytes) arrays. The copy code is inlined (rather than calling out to copy_longs).

The copy forwards and copy backwards case is identical because the small copy code reads all data into registers before writing any. Thankfully aarch64 has plenty of registers.

The rationale for choosing 80 as the limit is that it provides a guarantee than copy_longs is always called with at least 64 bytes, even after worst case alignment fixup. This means the small case code in copy_longs can be deleted (I have put an assert in copy longs to check it is never called with < 64 bytes).

2) http://cr.openjdk.java.net/~enevill/memopts/simd.patch

This uses SIMD ldp/stp Qx, Qy instructions instead of scalar ldp/stp instructions, thereby loading/storing 32 bytes at a time instead of 16.

It also extends the small copy code to copy 0-96 instead of 0-80 (because 80 is not divisible by 32).

This improves performance on some micro-arches and not on others so I have provided a -XX:+UseSIMDForMemoryOps switch which defaults to false (we could look at enabling this by default for micro-arches where we know SIMD is better).

I have prepared a set of performance measurements on memory copies between 0 & 96 bytes in steps of 1 (which shows the effect of the small copy optimisations) and also between 0 & 1024 in steps of 16. I have prepared these for 3 different micro-arches. The results are at

http://cr.openjdk.java.net/~enevill/memopts/twoopts.pdf

In these charts the blue 'original' line is the jdk9 tip as of earlier today. The red 'small copy' line is after application of the small copy patch above. The yellow 'SIMD' line is after the cumulative application of the small copy patch and the simd patch.

The charts show time taken so smaller is better. I have normalised the charts by varying the number of iteration so all results are in the 0-1200 range. Because the number of iterations was different for each micro-arch no information should be inferred as to the relative performance of different micro-arches. The charts should only be used to compare the performance before and after application of the above patches.

All the best,
Ed.




More information about the aarch64-port-dev mailing list