RFR: 8256488: [aarch64] Use ldpq/stpq instead of ld4/st4 for small copies in StubGenerator::copy_memory

Volker Simonis simonis at openjdk.java.net
Mon Nov 23 21:07:05 UTC 2020


On Fri, 20 Nov 2020 17:57:03 GMT, Volker Simonis <simonis at openjdk.org> wrote:

>> Here is the demonstration why ldpq/stpq is slightly faster than ld4/st4 on Graviton1:
>> From Arm Cortex A72 Optimization Guide (Graviton 1):
>> | instr | exec lat |  thr | pipelines |
>> |------|--------|-----|----------|
>> | ldp | 6 | 1/2 | L |
>> | stp | 4 | 1/4 | I/S |
>> | ld4 | 11 | 1/4 | V/L |
>> | st4 | 8 | 1/8 | V/S |
>> 
>> There are one L, one S and two V.
>> Estimated execution time for:
>> ld4
>> ldpq
>> st4
>> stpq
>> | cycle | instr issued |
>> |------|---------|
>> | 0 | ld4 (L, V0) |
>> | 1 | . |
>> | 2 | . |
>> | 3 | . |
>> | 4 | ldpq (L) |
>> | 5 | . |
>> | 6 | stpq (S, I0) |
>> | 7 | . |
>> | 8 | . |
>> | 9 | . |
>> | 10 | . |
>> | 11 | st4 (S, V0) |
>> | 12 | . |
>> | 13 | . |
>> | 14 | . |
>> | 15 | . |
>> | 16 | . |
>> | 17 | . |
>> | 18 | . |
>> 
>> Estimated execution time for:
>> ldpq
>> ldpq
>> ldpq
>> stpq
>> stpq
>> stpq
>> | cycle | instr issued |
>> |------|---------|
>> | 0 | ldpq (L) |
>> | 1 | . |
>> | 2 | ldpq (L) |
>> | 3 | . |
>> | 4 | ldpq (L) |
>> | 5 | . |
>> | 6 | stpq (S, I0) |
>> | 7 | . |
>> | 8 | . |
>> | 9 | . |
>> | 10 | stpq (S, I0) |
>> | 11 | . |
>> | 12 | . |
>> | 13 | . |
>> | 14 | stpq (S, I0) |
>> | 15 | . |
>> | 16 | . |
>> | 17 | . |
>> 
>> So it is 19 vs 18.
>
> Hi Evegeny,
> 
> thanks for fixing this and for the detailed explanation.
> 
> The change looks good to me and I will sponsor it.
> Can you please also post some performance numbers before and after your change?
> 
> @adinn, @theRealAph, @mo-beck as this was only tested on Graviton until now and we don't have access to other aarch64 implementations, could you please be so kind to check this on your hardware to make sure we don't introduce any regression?

> Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

Evegeny is part of the Amazon Corretto team and covered by Amazons OCA.

-------------

PR: https://git.openjdk.java.net/jdk/pull/1293


More information about the hotspot-compiler-dev mailing list