[aarch64-port-dev ] RFR: 8242029: AArch64: skip G1 array copy pre-barrier if marking not active

Nick Gasson nick.gasson at arm.com
Thu Apr 9 08:59:44 UTC 2020


On 04/08/20 20:38 pm, Andrew Haley wrote:
> On 4/8/20 7:22 AM, Nick Gasson wrote:
>> Do you think this is safe and worth doing?
>
> Please forgive me for turning this into a rather extreme thought
> experiment: if we hand-translate all GC runtime methods into all
> targets, we have an NxM problem, #collectors * #targets. So it's hard
> to justify without some heavy usage. And also, it means that if any of
> these runtime methods change, we'd risk falling behind on AArch64.
>
> Can you show us the assembly instructions that we'd save?

So I'm suggesting doing the following:

--- a/src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp
+++ b/src/hotspot/cpu/aarch64/gc/g1/g1BarrierSetAssembler_aarch64.cpp
@@ -87,13 +87,43 @@ void G1BarrierSetAssembler::gen_write_ref_array_pre_barrier(MacroAssembler* masm

 void G1BarrierSetAssembler::gen_write_ref_array_post_barrier(MacroAssembler* masm, DecoratorSet decorators,
                                                              Register start, Register count, Register scratch, RegSet saved_regs) {
-  __ push(saved_regs, sp);
-  assert_different_registers(start, count, scratch);
+
+  assert_different_registers(start, count, scratch, rscratch1, rscratch2);
   assert_different_registers(c_rarg0, count);
+
+  const Register card_addr = scratch;
+  const Register end_card_addr = rscratch1;
+
+  Label skip, slowpath, next;
+
+  __ cbz(count, skip);
+
+  __ lsr(card_addr, start, CardTable::card_shift);
+
+  __ lea(end_card_addr, Address(start, count, Address::lsl(LogBytesPerHeapOop)));
+  __ lsr(end_card_addr, end_card_addr, CardTable::card_shift);
+
+  __ load_byte_map_base(rscratch2);
+  __ add(card_addr, card_addr, rscratch2);
+  __ add(end_card_addr, end_card_addr, rscratch2);
+
+  __ bind(next);
+  __ ldrb(rscratch2, Address(card_addr));
+  __ cmpw(rscratch2, (int)G1CardTable::g1_young_card_val());
+  __ br(Assembler::NE, slowpath);
+  __ cmp(card_addr, end_card_addr);
+  __ br(Assembler::EQ, skip);
+  __ add(card_addr, card_addr, 1);
+  __ b(next);
+
+  __ bind(slowpath);
+  __ push(saved_regs, sp);
   __ mov(c_rarg0, start);
   __ mov(c_rarg1, count);
   __ call_VM_leaf(CAST_FROM_FN_PTR(address, G1BarrierSetRuntime::write_ref_array_post_entry), 2);
   __ pop(saved_regs, sp);
+
+  __ bind(skip);
 }


(Add change the call sites to not pass rscratch1 as scratch.)

It has a nice speedup on the ArrayCopy microbenchmarks, but I agree this
sort of thing is a maintenance burden if it doesn't affect real
workloads.

With JDK-8242029:

Benchmark                                    Mode  Cnt   Score   Error  Units
ArrayCopy.arrayCopyObject                    avgt   15  82.314 ? 0.641  ns/op
ArrayCopy.arrayCopyObjectNonConst            avgt   15  87.351 ? 6.820  ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward  avgt   15  54.272 ? 1.445  ns/op
ArrayCopy.arrayCopyObjectSameArraysForward   avgt   15  54.596 ? 1.329  ns/op

With the above modification:

Benchmark                                    Mode  Cnt   Score   Error  Units
ArrayCopy.arrayCopyObject                    avgt   15  58.913 ? 1.265  ns/op
ArrayCopy.arrayCopyObjectNonConst            avgt   15  64.682 ? 8.147  ns/op
ArrayCopy.arrayCopyObjectSameArraysBackward  avgt   15  36.866 ? 1.319  ns/op
ArrayCopy.arrayCopyObjectSameArraysForward   avgt   15  30.445 ? 3.719  ns/op


Thanks,
Nick



More information about the hotspot-gc-dev mailing list