Request for discussion: rewrite invokeinterface dispatch, JMH benchmark

Mon Oct 7 14:18:51 UTC 2024

I've been looking at rewriting invokeinterface, with a view to making
it more efficient and predictable on today's hardware, hopefully (near)
O(1) execution time. Also, we (again, hopefully) wouldn't need to
dynamically generate and manage itable stubs.

I've been trying a few approaches and don't yet have anything ready to
present, but I've come across an interesting anomaly in our
benchmarking. No matter what I did, and however bad my experiment, the
performance barely changed at all! It was as though my changes were
doing nothing, but eyeballing the generated code showed it was
different.

org.openjdk.bench.vm.compiler.InterfaceCalls.test2ndInt5Types looks
like this:

     as[0] = new FirstClass();
     as[1] = new SecondClass();
     as[2] = new ThirdClass();
     as[3] = new FourthClass();
     as[4] = new FifthClass();

     // ...

     int l = 0;

     @Benchmark
     public int test2ndInt5Types() {
         SecondInterface ai = (SecondInterface) as[l];
         l = ++ l % asLength;
         return ai.getIntSecond();
     }

That is to say, we serially step through an array, invoking the same
interface method on a different concrete class in turn.

The performance (Apple M1) is sparklingly good:

InterfaceCalls.test2ndInt5Types    6.026 ± 0.009      ns/op

But this is so fast as to be incredible, only 19.3 clocks per
invocation, including the control loop and the called method. A load
from L1 cache takes about 3-4 cycles, and there are several dependent
loads in the method lookup path. I suspected that because this
benchmark is unrealistically predictable, it does not fairly represent
real-world performance.

So, let's try mixing it up a little, and jump about rather than
cycling through the array:

     static final int scramble(int n) {
         int x = n;
         x ^= x << 13;
         x ^= x >>> 17;
         x ^= x << 5;
         return x == 0 ? 1 : x;
     }

     @Benchmark
     public int test2ndInt5TypesScrambled() {
         l = scramble(l);
         SecondInterface ai = (SecondInterface) as[Math.floorMod(l, asLength)];
         return ai.getIntSecond();
     }

This adds only a few instructions, but the measured performance is
radically different:

InterfaceCalls.test2ndInt5TypesScrambled  19.363 ± 0.084      ns/op

This is 62 clocks per invocation, and I suspect this result is far
more realistic. But is it really? Maybe invokeinterface calls are
generally very predictable, so the benchmark we already have is
representative.

Questions:

- Which benchmarks should we be optimizing for? I guess it could be
   the scrambled one, but maybe that would have no benefit if generally
   (or overwhelmingly often) invokeinterface is predictable.

- How many of the (micro-)benchmarks in HotSpot suffer from this
   problem? I'm guessing a lot of them, and perhaps it's partly because
   they were written in the days when speculative execution were less
   aggressive and branch prediction wasn't so good.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671