[aarch64-port-dev ] AArch64: follow up array copy investigation on misaligned peeling

Wed Feb 17 13:21:11 UTC 2016

Hi Andrew and all,

Follow up with early discussion about forward and backward array copy
performance, current finding is
1. Optimizing misaligned load/store in backward array copy doesn't help on
array copy performance, I suggest leave it unchanged now.
2. There is some chances to optimizing array copy peeling/tailing with
combined 8 byte load/store. But might introduce extra stubs and complicate
code.
Would you please help comment?

Firstly, remove unaligned reference by reorder copy orders from small to
large (copy 1 byte first, 8 byte at last) when peeling. However it is even
a little bit slow compared with original implementation.
Test case is  http://people.linaro.org/~hui.shi/arraycopy/TestPeelAlign.java
Performance result in
http://people.linaro.org/~hui.shi/arraycopy/arraycopy_align_and_combine_Test.pdf
Patch is http://people.linaro.org/~hui.shi/arraycopy/peelingFromSmall.patch
Test case is typical backward array copy scenario (insert some element in
array and move tail array backward). From profiling, UNALIGNED_LDST_SPEC
event drops a lot with patch. In my understanding, load address cross cache
line boundary might trigger hardware prefetcher earlier than aligned
access. So fixing unaligned access seems not helpful in array copy peeling.

Secondly, as unaligned access doesn't show degradation in this case,
further experiment is folding consecutive branches/load/stores into one 8
byte unaligned load/store. Following is updated stub code for byte array
copy. This is legal when src and dst distance is bigger than 8 bytes. This
is safe in cases like String.getChars String.getBytes. Perform different
combination tests, it works best for byte array copy and still helpful for
short array copy. Check result in pdf "opt" column is for this optimization.

For StringConcat test (
http://people.linaro.org/~hui.shi/arraycopy/StringConcatTest.java), though
array copy only takes 25% cycles in this test, entire test can still see
3.5% improvement with this combine load/store optimization.  However I
wondering if this is the proper way to improve these test-bit-load-store
code sequence. This will requires extra really “disjoint” array copy stub
code, current disjoint array copy only means it can safely perform forward
array copy. Or introduce no "overlap" test at runtime. My personal tradeoff
is leaving array copy code unchanged and keep it simply and consistent now.

Before patch
StubRoutines::jbyte_disjoint_arraycopy [0x0000ffff7897f7c0,
0x0000ffff7897f860[ (160 bytes)
  0x0000ffff7897f7e0: tbz       w9, #3, Stub::jbyte_disjoint_arraycopy+44
0x0000ffff7897f7ec
  0x0000ffff7897f7e4: ldr       x8, [x0],#8
  0x0000ffff7897f7e8: str       x8, [x1],#8
  0x0000ffff7897f7ec: tbz       w9, #2, Stub::jbyte_disjoint_arraycopy+56
0x0000ffff7897f7f8
  0x0000ffff7897f7f0: ldr       w8, [x0],#4
  0x0000ffff7897f7f4: str       w8, [x1],#4
  0x0000ffff7897f7f8: tbz       w9, #1, Stub::jbyte_disjoint_arraycopy+68
0x0000ffff7897f804
  0x0000ffff7897f7fc: ldrh      w8, [x0],#2
  0x0000ffff7897f800: strh      w8, [x1],#2
  0x0000ffff7897f804: tbz       w9, #0, Stub::jbyte_disjoint_arraycopy+80
0x0000ffff7897f810
  0x0000ffff7897f808: ldrb      w8, [x0],#1
  0x0000ffff7897f80c: strb      w8, [x1],#1
  0x0000ffff7897f810: cmp       x2, #0x10
  0x0000ffff7897f814: b.lt      Stub::jbyte_disjoint_arraycopy+96
0x0000ffff7897f820
  0x0000ffff7897f818: lsr       x9, x2, #3
  0x0000ffff7897f81c: bl        Stub::foward_copy_longs+28
0x0000ffff7897f5c0

Code after patch
StubRoutines::jbyte_disjoint_arraycopy [0x0000ffff6c97f7c0,
0x0000ffff6c97f87c[ (188 bytes)
// peeling for alignment
  0x0000ffff6c97f7e0: tbz       w9, #3, Stub::jbyte_disjoint_arraycopy+48
0x0000ffff6c97f7f0
  0x0000ffff6c97f7e4: sub       x9, x9, #0x8
  0x0000ffff6c97f7e8: ldr       x8, [x0],#8
  0x0000ffff6c97f7ec: str       x8, [x1],#8
  0x0000ffff6c97f7f0: ldr       x8, [x0]
  0x0000ffff6c97f7f4: str       x8, [x1]
  0x0000ffff6c97f7f8: add       x0, x0, x9
  0x0000ffff6c97f7fc: add       x1, x1, x9
  0x0000ffff6c97f800: cmp       x2, #0x10
  0x0000ffff6c97f804: b.lt      Stub::jbyte_disjoint_arraycopy+124
0x0000ffff6c97f83c
  0x0000ffff6c97f808: lsr       x9, x2, #3
  0x0000ffff6c97f80c: bl        Stub::foward_copy_longs+28
0x0000ffff6c97f5c0

Regards
Hui