RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory
Evgeny Astigeevich
github.com+42899633+eastig at openjdk.java.net
Tue Nov 24 10:42:58 UTC 2020
On Tue, 24 Nov 2020 10:08:37 GMT, Andrew Haley <aph at openjdk.org> wrote:
> I think we need also some non-Neoverse N1 numbers. We need to keep in mind that this software runs on many implementations.
For all modern Cortex-A ldpq is either faster or the same as ld4, e.g see calculation for Cortex-A72 above. I cannot find any optimizations guides for Ampere eMAG, ThunderX/ThunderX2 and HiSilicon TSV110 to check what latencies and throughput ld4/ldpq have on them. I appreciate if someone helps with this. I don't expect non-Cortex implementations differ much from Cortex.
The main issue with ld4 is its low throughput. The intent of ld4 as I understand it is to load data and to process it after that.
> I'll have a look at some others.
Could you please share more information what CPUs you will check?
-------------
PR: https://git.openjdk.java.net/jdk/pull/1293
More information about the hotspot-compiler-dev
mailing list