RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Fri Dec 4 08:18:55 UTC 2020

On Tue, 24 Nov 2020 10:08:37 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This patch fixes 27%-48% performance regressions of small arraycopies on Graviton2 (Neoverse N1) when UseSIMDForMemoryOps is enabled. For such copies ldpq/stpq are used instead of ld4/st4.
>> This follows what the Arm Optimization Guide, including for Neoverse N1, recommends: Use discrete, non-writeback forms of load and store instructions while interleaving them.
>> 
>> The patch passed jtreg tier1-2 and all gtest tests with linux-aarch64-server-release build and UseSIMDForMemoryOps enabled.
>
> I think we need also some non-Neoverse N1 numbers. We need to keep in mind that this software runs on many implementations. I'll have a look at some others.

> > Hi @theRealAph,
> > I also have a patch to fix the unaligned copy small memory (< 16 bytes) when copy a big chunk of memory (> 96 bytes) in this function copy_memory_small(), but it couldn't impact the performance too much, I'm not sure if it is worth pushing to upstream. please refer to [1].
> > 
> > 1. [JBS-8149448](https://bugs.openjdk.java.net/browse/JDK-8149448)
> 
> Hi Jie,
> 
> Thank you for the information.
> As Andrew wrote, nowadays most of unaligned memory accesses don't have penalties on the Armv8 implementations. However, some accesses have penalties. For example these are the most common:
> 
> 1. Load operations that cross a cache-line (64-byte) boundary.
> 2. Store operations that cross a 16B boundary.
> 
> On some Armv8 implementations quad-word load operations can have penalties if they are not at least 4B aligned.
> 
> Regarding the unaligned copy small memory, I think getting it aligned improves the function by a few percent (~2-5%%). As the most time is spent in copying big chunks of memory this improvement won't be noticeable. For example if copy_memory_small takes 1% of time and it is improved by 5% then the total improvement will be:
> 1 / (0.99 + (0.01/1.05)) = 1.000476 or 0.0476%
> which is almost impossible to detect.
> 
> BTW, I tried to improve COPY_SMALL, _Copy_conjoint_words and _Copy_disjoint_words based on results of comparison with memcpy from the Arm optimised routines but I did not get any overall performance improvements. See [JDK-8255795](https://bugs.openjdk.java.net/browse/JDK-8255795) for more information.

Hi @eastig 

thanks your information.

yes, it's very hard to measure copy_memory_small performance. I designed a jmh case using unsafe copymemory to test it, which  just copies 100 bytes unaligned data, we know only <16 bytes unaligned data will be handlered by copy_memory_small.
the test shows ~1.5% improvement in this case.

I also noticed COPY_SMALL and other pd_con/disjoint_words functions are not better than memcpy (glibc 2.27 in my env), even on thunderX2 machine. I think there is still room for improvement to the inline assembly code in copy_linux_aarch64.inline.hpp and assembly code in copy_linux_aarch64.s. In addtion, both of them don't support SIMD/FP instrs so far, like ldpq. a test shows ~10% improvement on N1 if using 2 ldpq instead of 4 ldp when copy 64 bytes data, but not on other A72 machines.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293