RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Volker Simonis simonis at openjdk.java.net
Mon Nov 23 21:07:05 UTC 2020


On Thu, 19 Nov 2020 19:46:54 GMT, Eugene Astigeevich <github.com+42899633+eastig at openjdk.org> wrote:

>> Here is the demonstration why ldpq/stpq is faster than ld4/st4 on Graviton2:
>> From Arm Neoverse N1 Optimization Guide (Graviton 2):
>> | instr | exec lat |  thr | pipelines |
>> |------|--------|-----|----------|
>> | ldp | 7 | 1 | L |
>> | stp | 3 | 1/2 | V/L |
>> | ld4 | 10 | 1/5 | V/L |
>> | st4 | 9 | 1/6 | V/L |
>> 
>> There are two L and two V.
>> Estimated execution time for:
>> ld4
>> ldpq
>> st4
>> stpq
>> | cycle | instr issued |
>> |------|---------|
>> | 0 | ld4 (L0, V0), ldpq (L1) |
>> | 1 | . |
>> | 2 | . |
>> | 3 | . |
>> | 4 | . |
>> | 5 | . |
>> | 6 | . |
>> | 7 | stpq (L0, V0) |
>> | 8 | . |
>> | 9 | . |
>> | 10 | st4 (L0, V0) |
>> | 11 | . |
>> | 12 | . |
>> | 13 | . |
>> | 14 | . |
>> | 15 | . |
>> | 16 | . |
>> | 17 | . |
>> | 18 | . |
>> 
>> Estimated execution time for:
>> ldpq
>> ldpq
>> ldpq
>> stpq
>> stpq
>> stpq
>> | cycle | instr issued |
>> |------|---------|
>> | 0 | ldpq (L0), ldpq (L1) |
>> | 1 | ldpq (L0) |
>> | 2 | . |
>> | 3 | . |
>> | 4 | . |
>> | 5 | . |
>> | 6 | . |
>> | 7 | stpq (L0), stpq (L1) |
>> | 8 | . |
>> | 9 | stpq (L0) |
>> | 10 | . |
>> | 11 | . |
>> 
>> So it is 19 vs 12.
>
> Here is the demonstration why ldpq/stpq is slightly faster than ld4/st4 on Graviton1:
> From Arm Cortex A72 Optimization Guide (Graviton 1):
> | instr | exec lat |  thr | pipelines |
> |------|--------|-----|----------|
> | ldp | 6 | 1/2 | L |
> | stp | 4 | 1/4 | I/S |
> | ld4 | 11 | 1/4 | V/L |
> | st4 | 8 | 1/8 | V/S |
> 
> There are one L, one S and two V.
> Estimated execution time for:
> ld4
> ldpq
> st4
> stpq
> | cycle | instr issued |
> |------|---------|
> | 0 | ld4 (L, V0) |
> | 1 | . |
> | 2 | . |
> | 3 | . |
> | 4 | ldpq (L) |
> | 5 | . |
> | 6 | stpq (S, I0) |
> | 7 | . |
> | 8 | . |
> | 9 | . |
> | 10 | . |
> | 11 | st4 (S, V0) |
> | 12 | . |
> | 13 | . |
> | 14 | . |
> | 15 | . |
> | 16 | . |
> | 17 | . |
> | 18 | . |
> 
> Estimated execution time for:
> ldpq
> ldpq
> ldpq
> stpq
> stpq
> stpq
> | cycle | instr issued |
> |------|---------|
> | 0 | ldpq (L) |
> | 1 | . |
> | 2 | ldpq (L) |
> | 3 | . |
> | 4 | ldpq (L) |
> | 5 | . |
> | 6 | stpq (S, I0) |
> | 7 | . |
> | 8 | . |
> | 9 | . |
> | 10 | stpq (S, I0) |
> | 11 | . |
> | 12 | . |
> | 13 | . |
> | 14 | stpq (S, I0) |
> | 15 | . |
> | 16 | . |
> | 17 | . |
> 
> So it is 19 vs 18.

Hi Evegeny,

thanks for fixing this and for the detailed explanation.

The change looks good to me and I will sponsor it.
Can you please also post some performance numbers before and after your change?

@adinn, @theRealAph, @mo-beck as this was only tested on Graviton until now and we don't have access to other aarch64 implementations, could you please be so kind to check this on your hardware to make sure we don't introduce any regression?

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293


More information about the hotspot-compiler-dev mailing list