[aarch64-port-dev ] Optimized memcpy() for Cortex
Andrew Haley
aph at redhat.com
Fri Aug 4 15:29:53 UTC 2017
Cortex®-A57/A72 processor manual contains this gem:
-----------------------------------------------------------------
The Cortex-A57 processor includes separate load and store pipelines,
which allow it to execute one load μop and one store μop every
cycle.
The following example shows a recommended instruction sequence for a
long memory copy in AArch32 state:
Loop_start:
SUBS r2,r2,#64
LDRD r3,r4,[r1,#0]
STRD r3,r4,[r0,#0]
LDRD r3,r4,[r1,#8]
STRD r3,r4,[r0,#8]
LDRD r3,r4,[r1,#16]
STRD r3,r4,[r0,#16]
LDRD r3,r4,[r1,#24]
STRD r3,r4,[r0,#24]
LDRD r3,r4,[r1,#32]
STRD r3,r4,[r0,#32]
LDRD r3,r4,[r1,#40]
STRD r3,r4,[r0,#40]
LDRD r3,r4,[r1,#48]
STRD r3,r4,[r0,#48]
LDRD r3,r4,[r1,#56]
STRD r3,r4,[r0,#56]
ADD r1,r1,#64
ADD r0,r0,#64
BGT Loop_start
A recommended copy routine for AArch64 would look similar to the
sequence above, but would use LDP/STP instructions.
-----------------------------------------------------------------
Our copy routines don't do this. I don't know if it would help.
--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the aarch64-port-dev
mailing list