[aarch64-port-dev ] RFR: 8242029: AArch64: skip G1 array copy pre-barrier if marking not active

Thu Apr 9 16:31:38 UTC 2020

On 4/9/20 9:59 AM, Nick Gasson wrote:
> It has a nice speedup on the ArrayCopy microbenchmarks, but I agree this
> sort of thing is a maintenance burden if it doesn't affect real
> workloads.

Now you've got me interested.  :-)

I'm looking at the code we we execute when we call the runtime. The
call_VM_leaf() we generate is

  0x0000ffffa913a6ec:   mov	x0, x1
  0x0000ffffa913a6f0:   mov	x1, x2
  0x0000ffffa913a6f4:   stp	x8, x12, [sp, #-16]!
 ;; 0xFFFFBCE50CD4
  0x0000ffffa913a6f8:   mov	x8, #0xcd4                 	// #3284
  0x0000ffffa913a6fc:   movk	x8, #0xbce5, lsl #16
  0x0000ffffa913a700:   movk	x8, #0xffff, lsl #32
  0x0000ffffa913a704:   blr	x8
  0x0000ffffa913a708:   ldp	x8, x12, [sp], #16
  0x0000ffffa913a70c:   isb

As discussed, we can lose the ISB here. If we're not called from the
interpreter we can also lose the saving of r12 and rscratch1.

This calls G1BarrierSetRuntime::write_ref_array_post_entry()

=> 0x0000ffffbd89d750 <+0>:	adrp	x2, 0xffffbe2ae000
   0x0000ffffbd89d754 <+4>:	adrp	x4, 0xffffbe2aa000
   0x0000ffffbd89d758 <+8>:	and	x3, x0, #0xfffffffffffffff8
   0x0000ffffbd89d75c <+12>:	ldr	x2, [x2, #264]
   0x0000ffffbd89d760 <+16>:	ldr	x4, [x4, #2024]
   0x0000ffffbd89d764 <+20>:	ldrsw	x2, [x2]
   0x0000ffffbd89d768 <+24>:	madd	x2, x2, x1, x0
   0x0000ffffbd89d76c <+28>:	ldr	x0, [x4]
   0x0000ffffbd89d770 <+32>:	add	x2, x2, #0x7
   0x0000ffffbd89d774 <+36>:	and	x2, x2, #0xfffffffffffffff8
   0x0000ffffbd89d778 <+40>:	adrp	x4, 0xffffbd895000
   0x0000ffffbd89d77c <+44>:	sub	x2, x2, x3
   0x0000ffffbd89d780 <+48>:	add	x4, x4, #0x640
   0x0000ffffbd89d784 <+52>:	ldr	x5, [x0]
   0x0000ffffbd89d788 <+56>:	lsr	x2, x2, #3
   0x0000ffffbd89d78c <+60>:	ldr	x7, [x5, #88]
   0x0000ffffbd89d790 <+64>:	cmp	x7, x4
   0x0000ffffbd89d794 <+68>:	b.ne	0xffffbd89d7a8
   0x0000ffffbd89d798 <+72>:	ldr	x4, [x5, #56]
   0x0000ffffbd89d79c <+76>:	mov	x1, x3
   0x0000ffffbd89d7a0 <+80>:	mov	x16, x4
   0x0000ffffbd89d7a4 <+84>:	br	x16

which seems to be a bunch of stuff to discover the adresses to scan,
aligning them properly, followed by a virtual dispatch to
G1BarrierSet::invalidate(), which contains the loop which scans the
card table:

   0x0000ffffbda250a0 <+0>:	cbz	x2, 0xffffbda25170 <G1BarrierSet::invalidate(MemRegion)+208>
   0x0000ffffbda250a4 <+4>:	stp	x29, x30, [sp, #-48]!
   0x0000ffffbda250a8 <+8>:	add	x2, x1, x2, lsl #3
   0x0000ffffbda250ac <+12>:	mov	x29, sp
   0x0000ffffbda250b0 <+16>:	str	x21, [sp, #32]
   0x0000ffffbda250b4 <+20>:	sub	x21, x2, #0x8
   0x0000ffffbda250b8 <+24>:	ldr	x0, [x0, #64]
   0x0000ffffbda250bc <+28>:	ldr	x0, [x0, #72]
   0x0000ffffbda250c0 <+32>:	add	x1, x0, x1, lsr #9
   0x0000ffffbda250c4 <+36>:	add	x21, x0, x21, lsr #9
   0x0000ffffbda250c8 <+40>:	cmp	x21, x1
   0x0000ffffbda250cc <+44>:	b.cc	0xffffbda25164 <G1BarrierSet::invalidate(MemRegion)+196>  // b.lo, b.ul, b.last
   0x0000ffffbda250d0 <+48>:	stp	x19, x20, [sp, #16]
   0x0000ffffbda250d4 <+52>:	b	0xffffbda250e0 <G1BarrierSet::invalidate(MemRegion)+64>

   0x0000ffffbda250d8 <+56>:	cmp	x21, x1
   0x0000ffffbda250dc <+60>:	b.cc	0xffffbda25160 <G1BarrierSet::invalidate(MemRegion)+192>  // b.lo, b.ul, b.last
   0x0000ffffbda250e0 <+64>:	ldrb	w0, [x1]
   0x0000ffffbda250e4 <+68>:	mov	x19, x1
   0x0000ffffbda250e8 <+72>:	add	x1, x1, #0x1
   0x0000ffffbda250ec <+76>:	and	w0, w0, #0xff
   0x0000ffffbda250f0 <+80>:	cmp	w0, #0x8
   0x0000ffffbda250f4 <+84>:	b.eq	0xffffbda250d8 <G1BarrierSet::invalidate(MemRegion)+56>  // b.none

...

   0x0000ffffbda25160 <+192>:	ldp	x19, x20, [sp, #16]
   0x0000ffffbda25164 <+196>:	ldr	x21, [sp, #32]
   0x0000ffffbda25168 <+200>:	ldp	x29, x30, [sp], #48
   0x0000ffffbda2516c <+204>:	ret

This clearly is a fair bit more than what we'd do by hand. The thing
that baffles me, I guess, is why the runtime does all this extra
stuff.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671