[aarch64-port-dev ] AArch64: follow up array copy investigation on misaligned peeling
Hui Shi
hui.shi at linaro.org
Wed Feb 17 13:21:11 UTC 2016
Hi Andrew and all,
Follow up with early discussion about forward and backward array copy
performance, current finding is
1. Optimizing misaligned load/store in backward array copy doesn't help on
array copy performance, I suggest leave it unchanged now.
2. There is some chances to optimizing array copy peeling/tailing with
combined 8 byte load/store. But might introduce extra stubs and complicate
code.
Would you please help comment?
Firstly, remove unaligned reference by reorder copy orders from small to
large (copy 1 byte first, 8 byte at last) when peeling. However it is even
a little bit slow compared with original implementation.
Test case is http://people.linaro.org/~hui.shi/arraycopy/TestPeelAlign.java
Performance result in
http://people.linaro.org/~hui.shi/arraycopy/arraycopy_align_and_combine_Test.pdf
Patch is http://people.linaro.org/~hui.shi/arraycopy/peelingFromSmall.patch
Test case is typical backward array copy scenario (insert some element in
array and move tail array backward). From profiling, UNALIGNED_LDST_SPEC
event drops a lot with patch. In my understanding, load address cross cache
line boundary might trigger hardware prefetcher earlier than aligned
access. So fixing unaligned access seems not helpful in array copy peeling.
Secondly, as unaligned access doesn't show degradation in this case,
further experiment is folding consecutive branches/load/stores into one 8
byte unaligned load/store. Following is updated stub code for byte array
copy. This is legal when src and dst distance is bigger than 8 bytes. This
is safe in cases like String.getChars String.getBytes. Perform different
combination tests, it works best for byte array copy and still helpful for
short array copy. Check result in pdf "opt" column is for this optimization.
For StringConcat test (
http://people.linaro.org/~hui.shi/arraycopy/StringConcatTest.java), though
array copy only takes 25% cycles in this test, entire test can still see
3.5% improvement with this combine load/store optimization. However I
wondering if this is the proper way to improve these test-bit-load-store
code sequence. This will requires extra really “disjoint” array copy stub
code, current disjoint array copy only means it can safely perform forward
array copy. Or introduce no "overlap" test at runtime. My personal tradeoff
is leaving array copy code unchanged and keep it simply and consistent now.
Before patch
StubRoutines::jbyte_disjoint_arraycopy [0x0000ffff7897f7c0,
0x0000ffff7897f860[ (160 bytes)
0x0000ffff7897f7e0: tbz w9, #3, Stub::jbyte_disjoint_arraycopy+44
0x0000ffff7897f7ec
0x0000ffff7897f7e4: ldr x8, [x0],#8
0x0000ffff7897f7e8: str x8, [x1],#8
0x0000ffff7897f7ec: tbz w9, #2, Stub::jbyte_disjoint_arraycopy+56
0x0000ffff7897f7f8
0x0000ffff7897f7f0: ldr w8, [x0],#4
0x0000ffff7897f7f4: str w8, [x1],#4
0x0000ffff7897f7f8: tbz w9, #1, Stub::jbyte_disjoint_arraycopy+68
0x0000ffff7897f804
0x0000ffff7897f7fc: ldrh w8, [x0],#2
0x0000ffff7897f800: strh w8, [x1],#2
0x0000ffff7897f804: tbz w9, #0, Stub::jbyte_disjoint_arraycopy+80
0x0000ffff7897f810
0x0000ffff7897f808: ldrb w8, [x0],#1
0x0000ffff7897f80c: strb w8, [x1],#1
0x0000ffff7897f810: cmp x2, #0x10
0x0000ffff7897f814: b.lt Stub::jbyte_disjoint_arraycopy+96
0x0000ffff7897f820
0x0000ffff7897f818: lsr x9, x2, #3
0x0000ffff7897f81c: bl Stub::foward_copy_longs+28
0x0000ffff7897f5c0
Code after patch
StubRoutines::jbyte_disjoint_arraycopy [0x0000ffff6c97f7c0,
0x0000ffff6c97f87c[ (188 bytes)
// peeling for alignment
0x0000ffff6c97f7e0: tbz w9, #3, Stub::jbyte_disjoint_arraycopy+48
0x0000ffff6c97f7f0
0x0000ffff6c97f7e4: sub x9, x9, #0x8
0x0000ffff6c97f7e8: ldr x8, [x0],#8
0x0000ffff6c97f7ec: str x8, [x1],#8
0x0000ffff6c97f7f0: ldr x8, [x0]
0x0000ffff6c97f7f4: str x8, [x1]
0x0000ffff6c97f7f8: add x0, x0, x9
0x0000ffff6c97f7fc: add x1, x1, x9
0x0000ffff6c97f800: cmp x2, #0x10
0x0000ffff6c97f804: b.lt Stub::jbyte_disjoint_arraycopy+124
0x0000ffff6c97f83c
0x0000ffff6c97f808: lsr x9, x2, #3
0x0000ffff6c97f80c: bl Stub::foward_copy_longs+28
0x0000ffff6c97f5c0
Regards
Hui
More information about the aarch64-port-dev
mailing list