RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory
Eugene Astigeevich
github.com+42899633+eastig at openjdk.java.net
Mon Nov 23 21:07:04 UTC 2020
On Wed, 18 Nov 2020 14:10:48 GMT, Eugene Astigeevich <github.com+42899633+eastig at openjdk.org> wrote:
> This patch fixes 27%-48% performance regressions of small arraycopies on Graviton2 (Neoverse N1) when UseSIMDForMemoryOps is enabled. For such copies ldpq/stpq are used instead of ld4/st4.
> This follows what the Arm Optimization Guide, including for Neoverse N1, recommends: Use discrete, non-writeback forms of load and store instructions while interleaving them.
>
> The patch passed jtreg tier1-2 and all gtest tests with linux-aarch64-server-release build and UseSIMDForMemoryOps enabled.
Here is the demonstration why ldpq/stpq is faster than ld4/st4 on Graviton2:
>From Arm Neoverse N1 Optimization Guide (Graviton 2):
| instr | exec lat | thr | pipelines |
|------|--------|-----|----------|
| ldp | 7 | 1 | L |
| stp | 3 | 1/2 | V/L |
| ld4 | 10 | 1/5 | V/L |
| st4 | 9 | 1/6 | V/L |
There are two L and two V.
Estimated execution time for:
ld4
ldpq
st4
stpq
| cycle | instr issued |
|------|---------|
| 0 | ld4 (L0, V0), ldpq (L1) |
| 1 | . |
| 2 | . |
| 3 | . |
| 4 | . |
| 5 | . |
| 6 | . |
| 7 | stpq (L0, V0) |
| 8 | . |
| 9 | . |
| 10 | st4 (L0, V0) |
| 11 | . |
| 12 | . |
| 13 | . |
| 14 | . |
| 15 | . |
| 16 | . |
| 17 | . |
| 18 | . |
Estimated execution time for:
ldpq
ldpq
ldpq
stpq
stpq
stpq
| cycle | instr issued |
|------|---------|
| 0 | ldpq (L0), ldpq (L1) |
| 1 | ldpq (L0) |
| 2 | . |
| 3 | . |
| 4 | . |
| 5 | . |
| 6 | . |
| 7 | stpq (L0), stpq (L1) |
| 8 | . |
| 9 | stpq (L0) |
| 10 | . |
| 11 | . |
So it is 19 vs 12.
-------------
PR: https://git.openjdk.java.net/jdk/pull/1293
More information about the hotspot-compiler-dev
mailing list