[aarch64-port-dev ] Caller registers protection inside loop hurts performance

Wed Aug 15 04:30:15 UTC 2018

To compare with another port, perhaps because of having more registers to use, aarch64 has to protect more in this code snippet than that of x86, say (5 load + 5 store) vs (4 load + 2 store), almost doubled, so the performance is worse accordingly.

Regards
Patrick

-----Original Message-----
From: Andrew Haley <aph at redhat.com> 
Sent: Tuesday, August 14, 2018 11:10 PM
To: Patrick Zhang <patrick.zhang at amperecomputing.com>; aarch64-port-dev at openjdk.java.net
Subject: Re: [aarch64-port-dev ] Caller registers protection inside loop hurts performance

[NOTICE: This email originated from an external sender. Please be mindful of safe email handling and proprietary information protection practices.] ________________________________________________________________________________________________________________________

On 08/14/2018 08:38 AM, Patrick Zhang wrote:
> Ran string-density-bench.LengthBench and dumped out the assembly code, we can find that inside the main loop (do-while in attached java code snippet) there are a couple of registers (x13,x14,x15,x19,x20 here) spilled/filled when calling to the static function do_cmp(). If the loop count becomes larger, the extra overhead would be very heavy, if C2 could move these protections out of the loop, based on local analysis inside the caller function test_avgt_jmhStub(), the total time could be saved ~20-25% per my tests with count=4096 (the default parameter value). Do we have any opportunity to optimize this in aarch64-port?

It's a known problem in the register allocator, and affects all ports.
Probably very hard to fix.

--
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671