RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Tue Nov 24 10:42:58 UTC 2020

On Tue, 24 Nov 2020 10:08:37 GMT, Andrew Haley <aph at openjdk.org> wrote:

> I think we need also some non-Neoverse N1 numbers. We need to keep in mind that this software runs on many implementations. 

For all modern Cortex-A ldpq is either faster or the same as ld4, e.g see calculation for Cortex-A72 above. I cannot find any optimizations guides for Ampere eMAG, ThunderX/ThunderX2 and HiSilicon TSV110 to check what latencies and throughput ld4/ldpq have on them. I appreciate if someone helps with this. I don't expect non-Cortex implementations differ much from Cortex.
The main issue with ld4 is its low throughput. The intent of ld4 as I understand it is to load data and to process it after that.

> I'll have a look at some others.

Could you please share more information what CPUs you will check?

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293