[aarch64-port-dev ] Caller registers protection inside loop hurts performance

Tue Aug 14 07:38:26 UTC 2018

Ran string-density-bench.LengthBench and dumped out the assembly code, we can find that inside the main loop (do-while in attached java code snippet) there are a couple of registers (x13,x14,x15,x19,x20 here) spilled/filled when calling to the static function do_cmp(). If the loop count becomes larger, the extra overhead would be very heavy, if C2 could move these protections out of the loop, based on local analysis inside the caller function test_avgt_jmhStub(), the total time could be saved ~20-25% per my tests with count=4096 (the default parameter value). Do we have any opportunity to optimize this in aarch64-port?

Regards
Patrick

Test java code, simplified:
--------------------
http://cr.openjdk.java.net/~shade/density/string-density-bench.zip
public void test_avgt_jmhStub(…){
    long operations = 0;
    long realTime = 0;
    result.startTime = System.nanoTime();
    do {
        for (int c = 0; c < count; c++) {
                do_cmp(datas[c]);
        }
        operations++;
    } while(!control.isDone);
    result.stopTime = System.nanoTime();
    result.realTime = realTime;
    result.measuredOps = operations;
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public static int do_cmp(String cmp) {
    return cmp.length();
}

Assembly code of the do-while loop:
--------------------
      nop
e0:┌ ldr    x19, [sp]
    │  ldp    x14, x20, [sp,#16]
    │  ldr    x15, [sp,#32]
     │  ldr    w10, [x13,#24]
     │  lsl    x10, x10, #3
     │  ldr    w12, [x10,#12]
     │  add    x11, x10, w29, sxtw #2
     │  cmp    w29, w12
     │↓ b.cs   1a0
     │  ldr    w11, [x11,#16]
     │  stp    x20, x15, [sp,#24]
     │  stp    x13, x14, [sp,#8]
     │  str    x19, [sp]
     │  lsl    x1, x11, #3
     │  bl     ffffffffffcddc00 <static_call to do_cmp>
     │  ldr    x13, [sp,#8]
     │  ldr    w10, [x13,#12]
     │  add    w29, w29, #0x1
     │  cmp    w29, w10
     └  b.lt   e0
       ldr    x19, [sp]
       ldp    x14, x20, [sp,#16]
      ldr    x15, [sp,#32]