[aarch64-port-dev ] RFR: 8242029: AArch64: skip G1 array copy pre-barrier if marking not active
Andrew Haley
aph at redhat.com
Thu Apr 9 16:31:38 UTC 2020
On 4/9/20 9:59 AM, Nick Gasson wrote:
> It has a nice speedup on the ArrayCopy microbenchmarks, but I agree this
> sort of thing is a maintenance burden if it doesn't affect real
> workloads.
Now you've got me interested. :-)
I'm looking at the code we we execute when we call the runtime. The
call_VM_leaf() we generate is
0x0000ffffa913a6ec: mov x0, x1
0x0000ffffa913a6f0: mov x1, x2
0x0000ffffa913a6f4: stp x8, x12, [sp, #-16]!
;; 0xFFFFBCE50CD4
0x0000ffffa913a6f8: mov x8, #0xcd4 // #3284
0x0000ffffa913a6fc: movk x8, #0xbce5, lsl #16
0x0000ffffa913a700: movk x8, #0xffff, lsl #32
0x0000ffffa913a704: blr x8
0x0000ffffa913a708: ldp x8, x12, [sp], #16
0x0000ffffa913a70c: isb
As discussed, we can lose the ISB here. If we're not called from the
interpreter we can also lose the saving of r12 and rscratch1.
This calls G1BarrierSetRuntime::write_ref_array_post_entry()
=> 0x0000ffffbd89d750 <+0>: adrp x2, 0xffffbe2ae000
0x0000ffffbd89d754 <+4>: adrp x4, 0xffffbe2aa000
0x0000ffffbd89d758 <+8>: and x3, x0, #0xfffffffffffffff8
0x0000ffffbd89d75c <+12>: ldr x2, [x2, #264]
0x0000ffffbd89d760 <+16>: ldr x4, [x4, #2024]
0x0000ffffbd89d764 <+20>: ldrsw x2, [x2]
0x0000ffffbd89d768 <+24>: madd x2, x2, x1, x0
0x0000ffffbd89d76c <+28>: ldr x0, [x4]
0x0000ffffbd89d770 <+32>: add x2, x2, #0x7
0x0000ffffbd89d774 <+36>: and x2, x2, #0xfffffffffffffff8
0x0000ffffbd89d778 <+40>: adrp x4, 0xffffbd895000
0x0000ffffbd89d77c <+44>: sub x2, x2, x3
0x0000ffffbd89d780 <+48>: add x4, x4, #0x640
0x0000ffffbd89d784 <+52>: ldr x5, [x0]
0x0000ffffbd89d788 <+56>: lsr x2, x2, #3
0x0000ffffbd89d78c <+60>: ldr x7, [x5, #88]
0x0000ffffbd89d790 <+64>: cmp x7, x4
0x0000ffffbd89d794 <+68>: b.ne 0xffffbd89d7a8
0x0000ffffbd89d798 <+72>: ldr x4, [x5, #56]
0x0000ffffbd89d79c <+76>: mov x1, x3
0x0000ffffbd89d7a0 <+80>: mov x16, x4
0x0000ffffbd89d7a4 <+84>: br x16
which seems to be a bunch of stuff to discover the adresses to scan,
aligning them properly, followed by a virtual dispatch to
G1BarrierSet::invalidate(), which contains the loop which scans the
card table:
0x0000ffffbda250a0 <+0>: cbz x2, 0xffffbda25170 <G1BarrierSet::invalidate(MemRegion)+208>
0x0000ffffbda250a4 <+4>: stp x29, x30, [sp, #-48]!
0x0000ffffbda250a8 <+8>: add x2, x1, x2, lsl #3
0x0000ffffbda250ac <+12>: mov x29, sp
0x0000ffffbda250b0 <+16>: str x21, [sp, #32]
0x0000ffffbda250b4 <+20>: sub x21, x2, #0x8
0x0000ffffbda250b8 <+24>: ldr x0, [x0, #64]
0x0000ffffbda250bc <+28>: ldr x0, [x0, #72]
0x0000ffffbda250c0 <+32>: add x1, x0, x1, lsr #9
0x0000ffffbda250c4 <+36>: add x21, x0, x21, lsr #9
0x0000ffffbda250c8 <+40>: cmp x21, x1
0x0000ffffbda250cc <+44>: b.cc 0xffffbda25164 <G1BarrierSet::invalidate(MemRegion)+196> // b.lo, b.ul, b.last
0x0000ffffbda250d0 <+48>: stp x19, x20, [sp, #16]
0x0000ffffbda250d4 <+52>: b 0xffffbda250e0 <G1BarrierSet::invalidate(MemRegion)+64>
0x0000ffffbda250d8 <+56>: cmp x21, x1
0x0000ffffbda250dc <+60>: b.cc 0xffffbda25160 <G1BarrierSet::invalidate(MemRegion)+192> // b.lo, b.ul, b.last
0x0000ffffbda250e0 <+64>: ldrb w0, [x1]
0x0000ffffbda250e4 <+68>: mov x19, x1
0x0000ffffbda250e8 <+72>: add x1, x1, #0x1
0x0000ffffbda250ec <+76>: and w0, w0, #0xff
0x0000ffffbda250f0 <+80>: cmp w0, #0x8
0x0000ffffbda250f4 <+84>: b.eq 0xffffbda250d8 <G1BarrierSet::invalidate(MemRegion)+56> // b.none
...
0x0000ffffbda25160 <+192>: ldp x19, x20, [sp, #16]
0x0000ffffbda25164 <+196>: ldr x21, [sp, #32]
0x0000ffffbda25168 <+200>: ldp x29, x30, [sp], #48
0x0000ffffbda2516c <+204>: ret
This clearly is a fair bit more than what we'd do by hand. The thing
that baffles me, I guess, is why the runtime does all this extra
stuff.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-gc-dev
mailing list