RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Eugene Astigeevich github.com+42899633+eastig at openjdk.java.net
Mon Nov 23 21:07:04 UTC 2020


On Thu, 19 Nov 2020 19:18:55 GMT, Eugene Astigeevich <github.com+42899633+eastig at openjdk.org> wrote:

>> This patch fixes 27%-48% performance regressions of small arraycopies on Graviton2 (Neoverse N1) when UseSIMDForMemoryOps is enabled. For such copies ldpq/stpq are used instead of ld4/st4.
>> This follows what the Arm Optimization Guide, including for Neoverse N1, recommends: Use discrete, non-writeback forms of load and store instructions while interleaving them.
>> 
>> The patch passed jtreg tier1-2 and all gtest tests with linux-aarch64-server-release build and UseSIMDForMemoryOps enabled.
>
> Here is the demonstration why ldpq/stpq is faster than ld4/st4 on Graviton2:
> From Arm Neoverse N1 Optimization Guide (Graviton 2):
> | instr | exec lat |  thr | pipelines |
> |------|--------|-----|----------|
> | ldp | 7 | 1 | L |
> | stp | 3 | 1/2 | V/L |
> | ld4 | 10 | 1/5 | V/L |
> | st4 | 9 | 1/6 | V/L |
> 
> There are two L and two V.
> Estimated execution time for:
> ld4
> ldpq
> st4
> stpq
> | cycle | instr issued |
> |------|---------|
> | 0 | ld4 (L0, V0), ldpq (L1) |
> | 1 | . |
> | 2 | . |
> | 3 | . |
> | 4 | . |
> | 5 | . |
> | 6 | . |
> | 7 | stpq (L0, V0) |
> | 8 | . |
> | 9 | . |
> | 10 | st4 (L0, V0) |
> | 11 | . |
> | 12 | . |
> | 13 | . |
> | 14 | . |
> | 15 | . |
> | 16 | . |
> | 17 | . |
> | 18 | . |
> 
> Estimated execution time for:
> ldpq
> ldpq
> ldpq
> stpq
> stpq
> stpq
> | cycle | instr issued |
> |------|---------|
> | 0 | ldpq (L0), ldpq (L1) |
> | 1 | ldpq (L0) |
> | 2 | . |
> | 3 | . |
> | 4 | . |
> | 5 | . |
> | 6 | . |
> | 7 | stpq (L0), stpq (L1) |
> | 8 | . |
> | 9 | stpq (L0) |
> | 10 | . |
> | 11 | . |
> 
> So it is 19 vs 12.

Here is the demonstration why ldpq/stpq is slightly faster than ld4/st4 on Graviton1:
>From Arm Cortex A72 Optimization Guide (Graviton 1):
| instr | exec lat |  thr | pipelines |
|------|--------|-----|----------|
| ldp | 6 | 1/2 | L |
| stp | 4 | 1/4 | I/S |
| ld4 | 11 | 1/4 | V/L |
| st4 | 8 | 1/8 | V/S |

There are one L, one S and two V.
Estimated execution time for:
ld4
ldpq
st4
stpq
| cycle | instr issued |
|------|---------|
| 0 | ld4 (L, V0) |
| 1 | . |
| 2 | . |
| 3 | . |
| 4 | ldpq (L) |
| 5 | . |
| 6 | stpq (S, I0) |
| 7 | . |
| 8 | . |
| 9 | . |
| 10 | . |
| 11 | st4 (S, V0) |
| 12 | . |
| 13 | . |
| 14 | . |
| 15 | . |
| 16 | . |
| 17 | . |
| 18 | . |

Estimated execution time for:
ldpq
ldpq
ldpq
stpq
stpq
stpq
| cycle | instr issued |
|------|---------|
| 0 | ldpq (L) |
| 1 | . |
| 2 | ldpq (L) |
| 3 | . |
| 4 | ldpq (L) |
| 5 | . |
| 6 | stpq (S, I0) |
| 7 | . |
| 8 | . |
| 9 | . |
| 10 | stpq (S, I0) |
| 11 | . |
| 12 | . |
| 13 | . |
| 14 | stpq (S, I0) |
| 15 | . |
| 16 | . |
| 17 | . |

So it is 19 vs 18.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293


More information about the hotspot-compiler-dev mailing list