Request for discussion: rewrite invokeinterface dispatch, JMH benchmark

Mon Oct 7 19:47:25 UTC 2024

Hi Andrew,

That is a great discovery. I tend to think we might both cases, the 
totally predictable and not predictable, because sometimes we really do 
want to see differences between hardware. We rewrote this one a couple 
years ago because the type-swapping previously used had more much 
overhead than whatever improvement we were evaluating in the itbl stub. 
A very lightweight way to do this is welcome.

It is true a lot of these JMH were started/written more than 10 years 
ago but now there are so many JMH to scan, we will just have to keep 
this kind of thing in mind as we are using them.

Regards,

Eric

On 10/7/24 10:18 AM, Andrew Haley wrote:
> I've been looking at rewriting invokeinterface, with a view to making
> it more efficient and predictable on today's hardware, hopefully (near)
> O(1) execution time. Also, we (again, hopefully) wouldn't need to
> dynamically generate and manage itable stubs.
>
> I've been trying a few approaches and don't yet have anything ready to
> present, but I've come across an interesting anomaly in our
> benchmarking. No matter what I did, and however bad my experiment, the
> performance barely changed at all! It was as though my changes were
> doing nothing, but eyeballing the generated code showed it was
> different.
>
> org.openjdk.bench.vm.compiler.InterfaceCalls.test2ndInt5Types looks
> like this:
>
>     as[0] = new FirstClass();
>     as[1] = new SecondClass();
>     as[2] = new ThirdClass();
>     as[3] = new FourthClass();
>     as[4] = new FifthClass();
>
>     // ...
>
>     int l = 0;
>
>     @Benchmark
>     public int test2ndInt5Types() {
>         SecondInterface ai = (SecondInterface) as[l];
>         l = ++ l % asLength;
>         return ai.getIntSecond();
>     }
>
> That is to say, we serially step through an array, invoking the same
> interface method on a different concrete class in turn.
>
> The performance (Apple M1) is sparklingly good:
>
> InterfaceCalls.test2ndInt5Types    6.026 ± 0.009      ns/op
>
> But this is so fast as to be incredible, only 19.3 clocks per
> invocation, including the control loop and the called method. A load
> from L1 cache takes about 3-4 cycles, and there are several dependent
> loads in the method lookup path. I suspected that because this
> benchmark is unrealistically predictable, it does not fairly represent
> real-world performance.
>
> So, let's try mixing it up a little, and jump about rather than
> cycling through the array:
>
>     static final int scramble(int n) {
>         int x = n;
>         x ^= x << 13;
>         x ^= x >>> 17;
>         x ^= x << 5;
>         return x == 0 ? 1 : x;
>     }
>
>     @Benchmark
>     public int test2ndInt5TypesScrambled() {
>         l = scramble(l);
>         SecondInterface ai = (SecondInterface) as[Math.floorMod(l, 
> asLength)];
>         return ai.getIntSecond();
>     }
>
> This adds only a few instructions, but the measured performance is
> radically different:
>
> InterfaceCalls.test2ndInt5TypesScrambled  19.363 ± 0.084 ns/op
>
> This is 62 clocks per invocation, and I suspect this result is far
> more realistic. But is it really? Maybe invokeinterface calls are
> generally very predictable, so the benchmark we already have is
> representative.
>
> Questions:
>
> - Which benchmarks should we be optimizing for? I guess it could be
>   the scrambled one, but maybe that would have no benefit if generally
>   (or overwhelmingly often) invokeinterface is predictable.
>
> - How many of the (micro-)benchmarks in HotSpot suffer from this
>   problem? I'm guessing a lot of them, and perhaps it's partly because
>   they were written in the days when speculative execution were less
>   aggressive and branch prediction wasn't so good.
>