[aarch64-port-dev ] Caller registers protection inside loop hurts performance
Patrick Zhang
patrick.zhang at amperecomputing.com
Thu Aug 16 02:25:11 UTC 2018
About fewer-registers-protected, here is the assembly code diff, aarch64 has 5 load + 5 store, while x86 (same code base) has half. The loop count is 4096 by default, managed by string density bench.
aarch64
--------------------
e0:ldr x19, [sp]
ldp x14, x20, [sp,#16]
ldr x15, [sp,#32]
ec:ldr w10, [x13,#24]
lsl x10, x10, #3
ldr w12, [x10,#12]
add x11, x10, w29, sxtw #2
cmp w29, w12
b.cs 1a0
ldr w11, [x11,#16]
stp x20, x15, [sp,#24]
stp x13, x14, [sp,#8]
str x19, [sp]
lsl x1, x11, #3
bl ffffffffffcddc00 <static_call to do_cmp>
ldr x13, [sp,#8]
ldr w10, [x13,#12]
add w29, w29, #0x1
cmp w29, w10
b.lt e0
x86-64
--------------------
e0:mov (%rsp),%r8
mov 0x18(%rsp),%rbx
e9:mov 0x10(%rsp),%r10
mov 0x18(%r10),%r10d
mov 0xc(%r12,%r10,8),%r11d
cmp %r11d,%ebp
jae 19d
mov %rbx,0x18(%rsp)
mov %r8,(%rsp)
shl $0x3,%r10
mov 0x10(%r10,%rbp,4),%r11d
mov %r11,%rsi
shl $0x3,%rsi
xchg %ax,%ax
callq ... <static_call to do_cmp>
mov 0x10(%rsp),%r10
mov 0xc(%r10),%r11d
inc %ebp
cmp %r11d,%ebp
jl e0
Attach java code again, for your convenience
--------------------
http://cr.openjdk.java.net/~shade/density/string-density-bench.zip
public void test_avgt_jmhStub(?){
long operations = 0;
long realTime = 0;
result.startTime = System.nanoTime();
do {
for (int c = 0; c < count; c++) {
do_cmp(datas[c]);
}
operations++;
} while(!control.isDone);
result.stopTime = System.nanoTime();
result.realTime = realTime;
result.measuredOps = operations;
}
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
public static int do_cmp(String cmp) {
return cmp.length();
}
Regards
Patrick
-----Original Message-----
From: Andrew Haley <aph at redhat.com>
Sent: Wednesday, August 15, 2018 3:56 PM
To: Patrick Zhang <patrick.zhang at amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] Caller registers protection inside loop hurts performance
[NOTICE: This email originated from an external sender. Please be mindful of safe email handling and proprietary information protection practices.] ________________________________________________________________________________________________________________________
On 08/15/2018 05:30 AM, Patrick Zhang wrote:
> To compare with another port, perhaps because of having more registers to use, aarch64 has to protect more in this code snippet than that of x86, say (5 load + 5 store) vs (4 load + 2 store), almost doubled, so the performance is worse accordingly.
I would not expect that to be true. The saved registers are those in use.
--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the aarch64-port-dev
mailing list