Redesign interface dispatch: progress report

Wed Nov 13 01:11:10 UTC 2024

This work sounds related to JDK-8221828, which says about interface 
dispatch:

"The proposal for invokeinterface calls is to give each interface method 
a unique number called "selector", and create a cuckoo-style hash table 
mapping selectors to code pointers for the concrete implementations of 
the interface method. The table is embedded straight into the native 
class. This allows dispatching the majority of the time with one 
indirect branch and one direct branch (compared to two direct branches 
for dynamically monomorphic calls today)."

dl

On 11/12/24 9:00 AM, Andrew Haley wrote:
> I've been trying to improve the way we do invokeinterface dispatch.
>
> TL/DR: given current high-end processors it's unlikely that we'll find
> anything significantly better than what we have now. Even megamorphic
> interface dispatch is fast, provided the processor accurately predicts
> the target.
>
> I'm writing this partly in order to save anyone else tempted to try it
> from some effort.
>
> In more detail:
>
> HotSpot's current implementation of invokeinterface uses techniques
> such as inlining and specialization in the case where there are only a
> couple of frequently-used implementations of an interface at a call
> site, in which case there is no need to do a true multi-way (we call
> it _megamorphic_) dispatch. Here I'm considering only the megamorphic
> case, not the special cases that can be inlined. In practice this
> means three or more implementations, invoked with similar frequency.
>
> Megamorphic interface dispatch works by generating itable stubs.
> Every interface method has an itable index, and every class that
> implements an interface contains an array that maps an itable index to
> the appropriate implementation. A class can implement several
> interfaces, in which case it has an itable for each one of those
> interfaces. [See
> https://wiki.openjdk.org/display/HotSpot/InterfaceCalls for more
> details.]
>
> In order to find the appropriate itable, the itable stub first
> performs a linear search to find an itable, then looks up the
> implementation in the appropriate itable.
>
> What I've tried:
>
> I thought that I could improve on the current solution in two
> different ways. First, get rid of the linear search by using a hash
> table to do the itable lookup, then dispatching as usual. Secondly, if
> we could generate a hash map which directly mapped from (interface
> Method) -> (implementation) then we would be able to avoid the search
> for any itable altogether.
>
> There's a JMH benchmark called InterfaceCalls.java that measures the
> time to call an interface method. While I was trying hashed dispatch
> algorithms I ran this benchmark, and discovered that it was unfeasibly
> fast even with the current implementation, running at about 4-5ns to
> invoke an interface method. When you consider that a simple memory
> load from L1 cache takes about 1ns, it's unlikely that this is
> realistic.
>
> So, I looked at the benchmark and wondered that it was perhaps too
> predictable, and maybe the processor was correctly predicting and
> prefetching every interface call. Simply randomizing the pattern of
> interface calls proved that this was indeed the case, and the less
> predictable benchmark took four times as long to run, on both Intel
> and Arm processors. This is https://github.com/openjdk/jdk/pull/21581
>
> Using a "fast" (well, the fastest I could think of) hash table instead
> of a linear search doesn't help unless there are a lot of interfaces,
> more than normal Java code. Otherwise the resulting hash table is
> slower or takes up too much memory, even when compressed. I did try
> piggybacking on the secondary-supers hash table (used by checkcast et
> al) but that didn't help. It was no faster than a linear search.
>
> I also tried a hash structure based on Bagwell's "Ideal Hash Trees"
> [https://lampwww.epfl.ch/papers/idealhashtrees.pdf] to map directly
> from (interface Method) -> (implementation). Again, it wasn't any
> faster than our current solution.
>
> Having said all of that, there was one advantage to my experiments:
> there was only a single stub, rather than an array of them. I'm not
> sure how important that is, really.
>
> In truth, what matters for megamorphic calls on today's Great Big
> Out-of-Order processors is not really the way that interface method
> lookup works, but *predictability*. As long as the target of a
> megamorphic invokeinterface is well predicted, it'll be fast. And if
> it's truly unpredictable it'll be slow. Abstract superclasses versus
> interfaces isn't really the issue here, although superclass dispatch,
> being simpler, may well be more predictable.
>
> As Aleksey Shipilev pointed out in 2015, mispredicted branches are the
> killer.
>
> In conclusion:
>
> I don't intend to produce a PR from this work. It may be possible to
> condense the array of itable stubs into a single one, but it's
> questionable whether it's worth the effort, and the result may run
> more slowly on some machines.
>
> Even though I haven't produced any useful improvement, it was worth a
> try, and it's always worth writing up negative results.
>