RFR: 8347917: AArch64: Enable upper GPR registers in C1

Dmitry Chuyko dchuyko at openjdk.org
Wed Feb 12 22:28:13 UTC 2025


On Sun, 26 Jan 2025 16:16:59 GMT, Andrew Haley <aph at openjdk.org> wrote:

> > > As for the different allocation order (to prefer platform callee-saved registers), do you think something simple like last->first order will work for all platforms?
> > 
> > 
> > It might. It's certainly an interesting thing to try. I'm particularly interested because it potentially reduces the overhead for type checks.
> 
> Let's do this in a separate patch.

Just a few things to keep here:

1. Even for aarch64 just reversing allocation order is not enough (callee preserved regs are saved in a caller).
2. Register saving overhead for runtime calls is there, but making a call without saving is still expensive.

Consider a benchmark that keeps few values alive and performs a runtime call:


    long[] arr;

    @Setup
    public void setup() {
        arr = new long[8];
    }

    @Benchmark
    public void test(Blackhole bh) {
        long v0 = arr[0]; long v1 = arr[1]; long v2 = arr[2]; long v3 = arr[3];
        long v4 = arr[4]; long v5 = arr[5]; long v6 = arr[6]; long v7 = arr[7];

        v1 += v0; v2 += v1; v3 += v2; v4 += v3; v5 += v4; v6 += v5; v7 += v6; v0 += v7;
        v1 *= v0; v2 *= v1; v3 *= v2; v4 *= v3; v5 *= v4; v6 *= v5; v7 *= v6; v0 *= v7;
        
        double d0 = Double.longBitsToDouble(v0);
        d0 = Math.sin(d0); // dsin is c1 runtime call
        v0 = Double.doubleToRawLongBits(d0);

        v1 += v0; v2 += v1; v3 += v2; v4 += v3; v5 += v4; v6 += v5; v7 += v6; v0 += v7;
        v1 *= v0; v2 *= v1; v3 *= v2; v4 *= v3; v5 *= v4; v6 *= v5; v7 *= v6; v0 *= v7;

        bh.consume(v0); bh.consume(v1); bh.consume(v2); bh.consume(v3);
        bh.consume(v4); bh.consume(v5); bh.consume(v6); bh.consume(v7);
    }



In '-XX:TieredStopAtLevel=1' mode I observe results like 28.337 ± 0.803  ns/op. If dsin is calculated and consumed in the end of the method, it's like 27.039 ± 0.182 ns/op. Without the call it's 22.595 ± 0.853  ns/op.

With the call hottest methods are distributed like


  89.12%         c1, level 1  org.openjdk.bench.vm.compiler.jmh_generated.VMCall_baseline_jmhTest::baseline_avgt_jmhStub, version 2, compile id 798 
  10.69%        runtime stub  StubRoutines::libmDsin

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23152#issuecomment-2654975579


More information about the hotspot-compiler-dev mailing list