RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Jie He github.com+10233373+jhe33 at openjdk.java.net
Mon Dec 7 02:23:13 UTC 2020


On Fri, 4 Dec 2020 15:28:56 GMT, Evgeny Astigeevich <github.com+42899633+eastig at openjdk.org> wrote:

>> This is because A72 has only one L and one S pipelines and ldpq/stpq have very low throughput. In contrast, N1 has two combined L/S pipelines and ldpq/stpq have improved throughput.

[yes, I also think so]

>> Regarding COPY_SMALL and other pd_con/disjoint_words functions: yes, they can be improved. If you have a workload which will benefit from this, please share with me. I tried a gcstress microbenchmark with SerialGC and 16Gb Java heap. Those functions took ~1.25% of time. So as in your case, no visible improvement. And this is the case when memory copying is on the critical path. In other GCs, memory copying is not on the critical path at all. This is also true for the compiler as well.

[no specific workload, I found it just during profiling gc, object copying is a time consumption operation.]

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293


More information about the hotspot-compiler-dev mailing list