[aarch64-port-dev ] Optimized memcpy() for Cortex

Andrew Haley aph at redhat.com
Fri Aug 4 15:29:53 UTC 2017


Cortex®-A57/A72 processor manual contains this gem:

-----------------------------------------------------------------
The Cortex-A57 processor includes separate load and store pipelines,
which allow it to execute one load μop and one store μop every
cycle.

The following example shows a recommended instruction sequence for a
long memory copy in AArch32 state:

Loop_start:
    SUBS    r2,r2,#64
    LDRD    r3,r4,[r1,#0]
    STRD    r3,r4,[r0,#0]
    LDRD    r3,r4,[r1,#8]
    STRD    r3,r4,[r0,#8]
    LDRD    r3,r4,[r1,#16]
    STRD    r3,r4,[r0,#16]
    LDRD    r3,r4,[r1,#24]
    STRD    r3,r4,[r0,#24]
    LDRD    r3,r4,[r1,#32]
    STRD    r3,r4,[r0,#32]
    LDRD    r3,r4,[r1,#40]
    STRD    r3,r4,[r0,#40]
    LDRD    r3,r4,[r1,#48]
    STRD    r3,r4,[r0,#48]
    LDRD    r3,r4,[r1,#56]
    STRD    r3,r4,[r0,#56]
    ADD     r1,r1,#64
    ADD     r0,r0,#64
    BGT     Loop_start

A recommended copy routine for AArch64 would look similar to the
sequence above, but would use LDP/STP instructions.
-----------------------------------------------------------------

Our copy routines don't do this.  I don't know if it would help.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


More information about the aarch64-port-dev mailing list