[aarch64-port-dev ] Caller registers protection inside loop hurts performance

Thu Aug 16 02:25:11 UTC 2018

About fewer-registers-protected, here is the assembly code diff, aarch64 has 5 load + 5 store, while x86 (same code base) has half. The loop count is 4096 by default, managed by string density bench.

aarch64
--------------------
e0:ldr    x19, [sp]
   ldp    x14, x20, [sp,#16]
   ldr    x15, [sp,#32]
ec:ldr    w10, [x13,#24]
   lsl    x10, x10, #3
   ldr    w12, [x10,#12]
   add    x11, x10, w29, sxtw #2
   cmp    w29, w12
   b.cs   1a0
   ldr    w11, [x11,#16]
   stp    x20, x15, [sp,#24]
   stp    x13, x14, [sp,#8]
   str    x19, [sp]
   lsl    x1, x11, #3
   bl     ffffffffffcddc00 <static_call to do_cmp>
   ldr    x13, [sp,#8]
   ldr    w10, [x13,#12]
   add    w29, w29, #0x1
   cmp    w29, w10
   b.lt   e0

x86-64
--------------------
e0:mov    (%rsp),%r8
   mov    0x18(%rsp),%rbx
e9:mov    0x10(%rsp),%r10
   mov    0x18(%r10),%r10d
   mov    0xc(%r12,%r10,8),%r11d
   cmp    %r11d,%ebp
   jae    19d
   mov    %rbx,0x18(%rsp)
   mov    %r8,(%rsp)
   shl    $0x3,%r10
   mov    0x10(%r10,%rbp,4),%r11d
   mov    %r11,%rsi
   shl    $0x3,%rsi
   xchg   %ax,%ax
   callq  ... <static_call to do_cmp>
   mov    0x10(%rsp),%r10
   mov    0xc(%r10),%r11d
   inc    %ebp
   cmp    %r11d,%ebp
   jl     e0

Attach java code again, for your convenience 
--------------------
http://cr.openjdk.java.net/~shade/density/string-density-bench.zip
public void test_avgt_jmhStub(?){
    long operations = 0;
    long realTime = 0;
    result.startTime = System.nanoTime();
    do {
        for (int c = 0; c < count; c++) {
                do_cmp(datas[c]);
        }
        operations++;
    } while(!control.isDone);
    result.stopTime = System.nanoTime();
    result.realTime = realTime;
    result.measuredOps = operations;
}

@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public static int do_cmp(String cmp) {
    return cmp.length();
}

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Wednesday, August 15, 2018 3:56 PM
To: Patrick Zhang <patrick.zhang at amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] Caller registers protection inside loop hurts performance

[NOTICE: This email originated from an external sender. Please be mindful of safe email handling and proprietary information protection practices.] ________________________________________________________________________________________________________________________

On 08/15/2018 05:30 AM, Patrick Zhang wrote:
> To compare with another port, perhaps because of having more registers to use, aarch64 has to protect more in this code snippet than that of x86, say (5 load + 5 store) vs (4 load + 2 store), almost doubled, so the performance is worse accordingly.

I would not expect that to be true.  The saved registers are those in use.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671