RFR: 8347917: AArch64: Enable upper GPR registers in C1 [v4]

Thu Feb 13 08:52:09 UTC 2025

On Thu, 30 Jan 2025 08:32:25 GMT, Dmitry Chuyko <dchuyko at openjdk.org> wrote:

>> This small change enables upper GPR registers in C1 so they are used, and used similar to C2. r19-r26 are declared as caller-saved and enabled, r27 (rheapbase) is declared caller-saved, r27 (rheapbase) and r29 (fp) are enabled conditionally similar to C2. r29 is already handled in MacroAssembler::build_frame()/remove_frame().
>> 
>> r18 is excluded on masOS and Windows as before. r27 is excluded when `UseCompressedOops` is on and `CompressedOops::base() != nullptr,` r29 is excluded when `PreserveFramePointer` is on.
>> 
>> Registers are declared caller-saved in c1_FrameMap_aarch64.cpp, conditionally enabled ones are in the tail of enabled range which is adjusted in c1_FrameMap_aarch64.hpp, the code there was made similar to x86 (JDK-6985015).
>> 
>> Register ranges are also updated in the linear scan itself and in OOP map generation.
>> 
>> Having more allocatable registers help to avoid spills in register hungry code and thus improve performance and code density and simplify compilation. In practice the code that operates so many values is not too frequent and upper registers are used less frequently than first ones. To perform testing it turned to be useful to run C1 in a special mode when registers are allocated from upper to lower in LinearScanWalker::find_free_reg():
>> 
>> 
>> -  for (int i = _first_reg; i <= _last_reg; i++) {
>> +  for (int i = _last_reg; i >= _first_reg; i--) {
>> 
>> 
>> It was also useful to run the JVM with C1 compilation only and with different GCs and small heaps like `-XX:TieredStopAtLevel=1 -Xmx256m -XX:+UseSerialGC`.
>> 
>> Tier1-3 jtreg tests showed no regression on linux-aarch64 (release, slowdebug, Xcomp) with either direct or reversed register allocation order. Windows and macOS were also tested to check r18 handling, +-CompressedOops and +-PreserveFramePointer combinations were tested.
>> 
>> SHA3 Java implementation is as an example of register hungry code. Throughput results greatly depend on the actual CPU being used. On Graviton 2 the improvement in the dedicated micro-benchmark is ~**19%** for longer arrays (`-XX:TieredStopAtLevel=1 -XX:+UnlockDiagnosticVMOptions -XX:-UseSHA3Intrinsics -jar ../benchmarks.jar -f 1 -wi 2 -i 3 -p digesterName=SHA3-256 -p length=16384 -jvmArgsAppend="-XX:-UseCompressedOops -XX:-PreserveFramePointer -Xmx31g -Xlog:gc+heap+coops=debug" MessageDigests.digest$`).
>
> Dmitry Chuyko has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Accurate caller-saved regs definition

On 2/12/25 22:25, Dmitry Chuyko wrote:
> Just a few things to keep here:
> 
>  1. Even for aarch64 just reversing allocation order is not enough (callee preserved regs are saved in a caller).
>  2. Register saving overhead for runtime calls is there, but making a call without saving is still expensive.

I don't quite understand what you're saying here. In the first sentence
you seem to imply that callee preserved regs are still saved in the caller,
unnecessarily. In the second sentence you say "saving overhead for runtime
calls is there," which seems to imply that there is some advantage to
using a callee-saved register for runtime calls.

Clearly this issue only applies to runtime calls, because Java has no
callee preserved regs.

What conclusion do you make from the benchmark you presented? That the
overhead of making a call from C1-compiled code is great, especially when
there are many spills?

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23152#issuecomment-2655906265