RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Fri Nov 27 11:30:55 UTC 2020

On Tue, 24 Nov 2020 10:08:37 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This patch fixes 27%-48% performance regressions of small arraycopies on Graviton2 (Neoverse N1) when UseSIMDForMemoryOps is enabled. For such copies ldpq/stpq are used instead of ld4/st4.
>> This follows what the Arm Optimization Guide, including for Neoverse N1, recommends: Use discrete, non-writeback forms of load and store instructions while interleaving them.
>> 
>> The patch passed jtreg tier1-2 and all gtest tests with linux-aarch64-server-release build and UseSIMDForMemoryOps enabled.
>
> I think we need also some non-Neoverse N1 numbers. We need to keep in mind that this software runs on many implementations. I'll have a look at some others.

> Hi @theRealAph,
> 
> I also have a patch to fix the unaligned copy small memory (< 16 bytes) when copy a big chunk of memory (> 96 bytes) in this function copy_memory_small(), but it couldn't impact the performance too much, I'm not sure if it is worth pushing to upstream. please refer to [1].
> 
> 1. [JBS-8149448](https://bugs.openjdk.java.net/browse/JDK-8149448)

Hi Jie,

Thank you for the information.
As Andrew wrote, nowadays most of unaligned memory accesses don't have penalties on the Armv8 implementations. However, some accesses have penalties. For example these are the most common:
1. Load operations that cross a cache-line (64-byte) boundary.
2. Store operations that cross a 16B boundary.

On some Armv8 implementations quad-word load operations can have penalties if they are not at least 4B aligned.

Regarding the unaligned copy small memory, I think getting it aligned improves the function by a few percent (~2-5%%). As the most time is spent in copying big chunks of memory this improvement won't be noticeable. For example if copy_memory_small takes 1% of time and it is improved by 5% then the total improvement will be:
1 / (0.99 + (0.01/1.05)) = 1.000476 or 0.0476%
which is almost impossible to detect.

BTW, I tried to improve COPY_SMALL, _Copy_conjoint_words and _Copy_disjoint_words based on results of comparison with memcpy from the Arm optimised routines but I did not get any overall performance improvements. See [JDK-8255795](https://bugs.openjdk.java.net/browse/JDK-8255795) for more information.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293