AArch64: follow up array copy investigation on misaligned peeling
Hi Andrew and all, Follow up with early discussion about forward and backward array copy performance, current finding is 1. Optimizing misaligned load/store in backward array copy doesn't help on array copy performance, I suggest leave it unchanged now. 2. There is some chances to optimizing array copy peeling/tailing with combined 8 byte load/store. But might introduce extra stubs and complicate code. Would you please help comment? Firstly, remove unaligned reference by reorder copy orders from small to large (copy 1 byte first, 8 byte at last) when peeling. However it is even a little bit slow compared with original implementation. Test case is http://people.linaro.org/~hui.shi/arraycopy/TestPeelAlign.java Performance result in http://people.linaro.org/~hui.shi/arraycopy/arraycopy_align_and_combine_Test... Patch is http://people.linaro.org/~hui.shi/arraycopy/peelingFromSmall.patch Test case is typical backward array copy scenario (insert some element in array and move tail array backward). From profiling, UNALIGNED_LDST_SPEC event drops a lot with patch. In my understanding, load address cross cache line boundary might trigger hardware prefetcher earlier than aligned access. So fixing unaligned access seems not helpful in array copy peeling. Secondly, as unaligned access doesn't show degradation in this case, further experiment is folding consecutive branches/load/stores into one 8 byte unaligned load/store. Following is updated stub code for byte array copy. This is legal when src and dst distance is bigger than 8 bytes. This is safe in cases like String.getChars String.getBytes. Perform different combination tests, it works best for byte array copy and still helpful for short array copy. Check result in pdf "opt" column is for this optimization. For StringConcat test ( http://people.linaro.org/~hui.shi/arraycopy/StringConcatTest.java), though array copy only takes 25% cycles in this test, entire test can still see 3.5% improvement with this combine load/store optimization. However I wondering if this is the proper way to improve these test-bit-load-store code sequence. This will requires extra really “disjoint” array copy stub code, current disjoint array copy only means it can safely perform forward array copy. Or introduce no "overlap" test at runtime. My personal tradeoff is leaving array copy code unchanged and keep it simply and consistent now. Before patch StubRoutines::jbyte_disjoint_arraycopy [0x0000ffff7897f7c0, 0x0000ffff7897f860[ (160 bytes) 0x0000ffff7897f7e0: tbz w9, #3, Stub::jbyte_disjoint_arraycopy+44 0x0000ffff7897f7ec 0x0000ffff7897f7e4: ldr x8, [x0],#8 0x0000ffff7897f7e8: str x8, [x1],#8 0x0000ffff7897f7ec: tbz w9, #2, Stub::jbyte_disjoint_arraycopy+56 0x0000ffff7897f7f8 0x0000ffff7897f7f0: ldr w8, [x0],#4 0x0000ffff7897f7f4: str w8, [x1],#4 0x0000ffff7897f7f8: tbz w9, #1, Stub::jbyte_disjoint_arraycopy+68 0x0000ffff7897f804 0x0000ffff7897f7fc: ldrh w8, [x0],#2 0x0000ffff7897f800: strh w8, [x1],#2 0x0000ffff7897f804: tbz w9, #0, Stub::jbyte_disjoint_arraycopy+80 0x0000ffff7897f810 0x0000ffff7897f808: ldrb w8, [x0],#1 0x0000ffff7897f80c: strb w8, [x1],#1 0x0000ffff7897f810: cmp x2, #0x10 0x0000ffff7897f814: b.lt Stub::jbyte_disjoint_arraycopy+96 0x0000ffff7897f820 0x0000ffff7897f818: lsr x9, x2, #3 0x0000ffff7897f81c: bl Stub::foward_copy_longs+28 0x0000ffff7897f5c0 Code after patch StubRoutines::jbyte_disjoint_arraycopy [0x0000ffff6c97f7c0, 0x0000ffff6c97f87c[ (188 bytes) // peeling for alignment 0x0000ffff6c97f7e0: tbz w9, #3, Stub::jbyte_disjoint_arraycopy+48 0x0000ffff6c97f7f0 0x0000ffff6c97f7e4: sub x9, x9, #0x8 0x0000ffff6c97f7e8: ldr x8, [x0],#8 0x0000ffff6c97f7ec: str x8, [x1],#8 0x0000ffff6c97f7f0: ldr x8, [x0] 0x0000ffff6c97f7f4: str x8, [x1] 0x0000ffff6c97f7f8: add x0, x0, x9 0x0000ffff6c97f7fc: add x1, x1, x9 0x0000ffff6c97f800: cmp x2, #0x10 0x0000ffff6c97f804: b.lt Stub::jbyte_disjoint_arraycopy+124 0x0000ffff6c97f83c 0x0000ffff6c97f808: lsr x9, x2, #3 0x0000ffff6c97f80c: bl Stub::foward_copy_longs+28 0x0000ffff6c97f5c0 Regards Hui
On 02/17/2016 01:21 PM, Hui Shi wrote:
For StringConcat test ( http://people.linaro.org/~hui.shi/arraycopy/StringConcatTest.java), though array copy only takes 25% cycles in this test, entire test can still see 3.5% improvement with this combine load/store optimization. However I wondering if this is the proper way to improve these test-bit-load-store code sequence. This will requires extra really “disjoint” array copy stub code, current disjoint array copy only means it can safely perform forward array copy. Or introduce no "overlap" test at runtime. My personal tradeoff is leaving array copy code unchanged and keep it simply and consistent now.
OK, that makes sense. My plan (such as it is) for tidying up the tail code is to convert three bit-test-and-branches into a single 8-way computed jump with an optimum sequence for all 8 cases. Sure, it will usually be mispredicted, but it's just a single jump. But really, once we're down to 3.5% of a contrived string- concatenation intensive test, it's questionable whether this is what we need to be spending time on. Thanks, Andrew.
participants (2)
-
Andrew Haley
-
Hui Shi