Redesign interface dispatch: progress report

Tue Nov 12 17:00:11 UTC 2024

I've been trying to improve the way we do invokeinterface dispatch.

TL/DR: given current high-end processors it's unlikely that we'll find
anything significantly better than what we have now. Even megamorphic
interface dispatch is fast, provided the processor accurately predicts
the target.

I'm writing this partly in order to save anyone else tempted to try it
from some effort.

In more detail:

HotSpot's current implementation of invokeinterface uses techniques
such as inlining and specialization in the case where there are only a
couple of frequently-used implementations of an interface at a call
site, in which case there is no need to do a true multi-way (we call
it _megamorphic_) dispatch. Here I'm considering only the megamorphic
case, not the special cases that can be inlined. In practice this
means three or more implementations, invoked with similar frequency.

Megamorphic interface dispatch works by generating itable stubs.
Every interface method has an itable index, and every class that
implements an interface contains an array that maps an itable index to
the appropriate implementation. A class can implement several
interfaces, in which case it has an itable for each one of those
interfaces. [See
https://wiki.openjdk.org/display/HotSpot/InterfaceCalls for more
details.]

In order to find the appropriate itable, the itable stub first
performs a linear search to find an itable, then looks up the
implementation in the appropriate itable.

What I've tried:

I thought that I could improve on the current solution in two
different ways. First, get rid of the linear search by using a hash
table to do the itable lookup, then dispatching as usual. Secondly, if
we could generate a hash map which directly mapped from (interface
Method) -> (implementation) then we would be able to avoid the search
for any itable altogether.

There's a JMH benchmark called InterfaceCalls.java that measures the
time to call an interface method. While I was trying hashed dispatch
algorithms I ran this benchmark, and discovered that it was unfeasibly
fast even with the current implementation, running at about 4-5ns to
invoke an interface method. When you consider that a simple memory
load from L1 cache takes about 1ns, it's unlikely that this is
realistic.

So, I looked at the benchmark and wondered that it was perhaps too
predictable, and maybe the processor was correctly predicting and
prefetching every interface call. Simply randomizing the pattern of
interface calls proved that this was indeed the case, and the less
predictable benchmark took four times as long to run, on both Intel
and Arm processors. This is https://github.com/openjdk/jdk/pull/21581

Using a "fast" (well, the fastest I could think of) hash table instead
of a linear search doesn't help unless there are a lot of interfaces,
more than normal Java code. Otherwise the resulting hash table is
slower or takes up too much memory, even when compressed. I did try
piggybacking on the secondary-supers hash table (used by checkcast et
al) but that didn't help. It was no faster than a linear search.

I also tried a hash structure based on Bagwell's "Ideal Hash Trees"
[https://lampwww.epfl.ch/papers/idealhashtrees.pdf] to map directly
from (interface Method) -> (implementation). Again, it wasn't any
faster than our current solution.

Having said all of that, there was one advantage to my experiments:
there was only a single stub, rather than an array of them. I'm not
sure how important that is, really.

In truth, what matters for megamorphic calls on today's Great Big
Out-of-Order processors is not really the way that interface method
lookup works, but *predictability*. As long as the target of a
megamorphic invokeinterface is well predicted, it'll be fast. And if
it's truly unpredictable it'll be slow. Abstract superclasses versus
interfaces isn't really the issue here, although superclass dispatch,
being simpler, may well be more predictable.

As Aleksey Shipilev pointed out in 2015, mispredicted branches are the
killer.

In conclusion:

I don't intend to produce a PR from this work. It may be possible to
condense the array of itable stubs into a single one, but it's
questionable whether it's worth the effort, and the result may run
more slowly on some machines.

Even though I haven't produced any useful improvement, it was worth a
try, and it's always worth writing up negative results.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671