[aarch64-port-dev ] Caller registers protection inside loop hurts performance
Patrick Zhang
patrick.zhang at amperecomputing.com
Tue Aug 14 07:38:26 UTC 2018
Ran string-density-bench.LengthBench and dumped out the assembly code, we can find that inside the main loop (do-while in attached java code snippet) there are a couple of registers (x13,x14,x15,x19,x20 here) spilled/filled when calling to the static function do_cmp(). If the loop count becomes larger, the extra overhead would be very heavy, if C2 could move these protections out of the loop, based on local analysis inside the caller function test_avgt_jmhStub(), the total time could be saved ~20-25% per my tests with count=4096 (the default parameter value). Do we have any opportunity to optimize this in aarch64-port?
Regards
Patrick
Test java code, simplified:
--------------------
http://cr.openjdk.java.net/~shade/density/string-density-bench.zip
public void test_avgt_jmhStub(…){
long operations = 0;
long realTime = 0;
result.startTime = System.nanoTime();
do {
for (int c = 0; c < count; c++) {
do_cmp(datas[c]);
}
operations++;
} while(!control.isDone);
result.stopTime = System.nanoTime();
result.realTime = realTime;
result.measuredOps = operations;
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public static int do_cmp(String cmp) {
return cmp.length();
}
Assembly code of the do-while loop:
--------------------
nop
e0:┌ ldr x19, [sp]
│ ldp x14, x20, [sp,#16]
│ ldr x15, [sp,#32]
│ ldr w10, [x13,#24]
│ lsl x10, x10, #3
│ ldr w12, [x10,#12]
│ add x11, x10, w29, sxtw #2
│ cmp w29, w12
│↓ b.cs 1a0
│ ldr w11, [x11,#16]
│ stp x20, x15, [sp,#24]
│ stp x13, x14, [sp,#8]
│ str x19, [sp]
│ lsl x1, x11, #3
│ bl ffffffffffcddc00 <static_call to do_cmp>
│ ldr x13, [sp,#8]
│ ldr w10, [x13,#12]
│ add w29, w29, #0x1
│ cmp w29, w10
└ b.lt e0
ldr x19, [sp]
ldp x14, x20, [sp,#16]
ldr x15, [sp,#32]
More information about the aarch64-port-dev
mailing list