RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Fri Dec 4 15:31:16 UTC 2020

On Fri, 4 Dec 2020 08:16:35 GMT, Jie He <github.com+10233373+jhe33 at openjdk.org> wrote:

> > > Hi @theRealAph,
> > > I also have a patch to fix the unaligned copy small memory (< 16 bytes) when copy a big chunk of memory (> 96 bytes) in this function copy_memory_small(), but it couldn't impact the performance too much, I'm not sure if it is worth pushing to upstream. please refer to [1].
> > > 
> > > 1. [JBS-8149448](https://bugs.openjdk.java.net/browse/JDK-8149448)
> > 
> > 
> > Hi Jie,
> > Thank you for the information.
> > As Andrew wrote, nowadays most of unaligned memory accesses don't have penalties on the Armv8 implementations. However, some accesses have penalties. For example these are the most common:
> > 
> > 1. Load operations that cross a cache-line (64-byte) boundary.
> > 2. Store operations that cross a 16B boundary.
> > 
> > On some Armv8 implementations quad-word load operations can have penalties if they are not at least 4B aligned.
> > Regarding the unaligned copy small memory, I think getting it aligned improves the function by a few percent (~2-5%%). As the most time is spent in copying big chunks of memory this improvement won't be noticeable. For example if copy_memory_small takes 1% of time and it is improved by 5% then the total improvement will be:
> > 1 / (0.99 + (0.01/1.05)) = 1.000476 or 0.0476%
> > which is almost impossible to detect.
> > BTW, I tried to improve COPY_SMALL, _Copy_conjoint_words and _Copy_disjoint_words based on results of comparison with memcpy from the Arm optimised routines but I did not get any overall performance improvements. See [JDK-8255795](https://bugs.openjdk.java.net/browse/JDK-8255795) for more information.
> 
> Hi @eastig
> 
> thanks your information.
> 
> yes, it's very hard to measure copy_memory_small performance. I designed a jmh case using unsafe copymemory to test it, which just copies 100 bytes unaligned data, we know only <16 bytes unaligned data will be handlered by copy_memory_small.
> the test shows ~1.5% improvement in this case.
> 
> I also noticed COPY_SMALL and other pd_con/disjoint_words functions are not better than memcpy (glibc 2.27 in my env), even on thunderX2 machine. I think there is still room for improvement to the inline assembly code in copy_linux_aarch64.inline.hpp and assembly code in copy_linux_aarch64.s. In addtion, both of them don't support SIMD/FP instrs so far, like ldpq. a test shows ~10% improvement on N1 if using 2 ldpq instead of 4 ldp when copy 64 bytes data, but not on other A72 machines.

This is because A72 has only one L and one S pipelines and ldpq/stpq have very low throughput. In contrast, N1 has two combined L/S pipelines and ldpq/stpq have improved throughput.

Regarding COPY_SMALL and other pd_con/disjoint_words functions: yes, they can be improved. If you have a workload which will benefit from this, please share with me. I tried a gcstress microbenchmark with SerialGC and 16Gb Java heap. Those functions took ~1.25% of time. So as in your case, no visible improvement. And this is the case when memory copying is on the critical path. In other GCs, memory copying is not on the critical path at all. This is also true for the compiler as well.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293