RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Tue Nov 24 13:39:57 UTC 2020

On Tue, 24 Nov 2020 10:40:35 GMT, Evgeny Astigeevich <github.com+42899633+eastig at openjdk.org> wrote:

>> I think we need also some non-Neoverse N1 numbers. We need to keep in mind that this software runs on many implementations. I'll have a look at some others.
>
>> I think we need also some non-Neoverse N1 numbers. We need to keep in mind that this software runs on many implementations. 
> 
> For all modern Cortex-A ldpq is either faster or the same as ld4, e.g see calculation for Cortex-A72 above. I cannot find any optimizations guides for Ampere eMAG, ThunderX/ThunderX2 and HiSilicon TSV110 to check what latencies and throughput ld4/ldpq have on them. I appreciate if someone helps with this. I don't expect non-Cortex implementations differ much from Cortex.
> The main issue with ld4 is its low throughput. The intent of ld4 as I understand it is to load data and to process it after that.
> 
>> I'll have a look at some others.
> 
> Could you please share more information what CPUs you will check?

> _Mailing list message from [Andrew Haley](mailto:aph at redhat.com) on [hotspot-compiler-dev](mailto:hotspot-compiler-dev at openjdk.java.net):_
> 
> On 24/11/2020 10:19, Evgeny Astigeevich wrote:
> 
> > The microbenchmarks are ArrayCopy* microbenchmarks which are a part of OpenJDK: https://github.com/openjdk/jdk/tree/master/test/micro/org/openjdk/bench/java/lang
> 
> Sorry, my mistake. I'll try this now.
> 

Not a problem. I am new to GitHub reviewing process and the OpenJDK project. I am still learning things.
Let me know if I need to run any additional benchmarks.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293