Redesign interface dispatch: progress report

Fri Nov 15 09:33:38 UTC 2024

Hi Andrew,

When I investigated this, my main driver really was to solve some bad interactions
between concurrent GC, concurrent code sweeping and the ICStubs of inline caches.

My first stab at this was to flatten all the tables and try to make invocations as fast
as possible so we could remove inline caches altogether, which seemed to play a
central role. It worked but was rather complex and sensitive to startup performance.

In the end I solved my problems by deleting the ICStubs and the sweeper instead. 
With them both gone, I did feel mostly happy about the state of things as all the bad
interactions that motivated my work had been dealt with. But I have wondered if there
was a missed opportunity to make megamorphic calls faster.

So a big thank you to you for picking this up and thoroughly investigating it. The answer
seems to be that there is no such missed opportunity, at this time. To me, your negative
result is very important. I think the key takeaway is that unpredictable call patterns are
going to fundamentally cause performance trouble anyway, making even the most carefully
tuned instruction sequences slow, because of the inherent unpredictability that hardware
can’t cope with very well.

Regarding 32 bit code pointers, I sort of tried that actually. If memory serves right,
I had support for a few more bits than 32, which I borrowed from the selector number
space. This way I could pack selector + code pointer in a 64 bit word still. It was fairly
compact really. But in the end, for megamorphic calls, the inherent unpredictability
is likely to still be painful, in terms of potential for accelerating megamorphic calls.

Thanks,
/Erik

> On 13 Nov 2024, at 15:59, Andrew Haley <aph-open at littlepinkcloud.com> wrote:
> 
> On 11/13/24 01:11, dean.long at oracle.com wrote:
> > This work sounds related to JDK-8221828, which says about interface
> > dispatch:
> >
> > "The proposal for invokeinterface calls is to give each interface
> > method a unique number called "selector", and create a cuckoo-style
> > hash table mapping selectors to code pointers for the concrete
> > implementations of the interface method. The table is embedded
> > straight into the native class. This allows dispatching the majority
> > of the time with one indirect branch and one direct branch (compared
> > to two direct branches for dynamically monomorphic calls today)."
> 
> Thank you for that reference. I discussed this at some length with
> Erik, the author of 8221828, at the time. The work I've been doing is
> a follow-on from his.
> 
> I believe this result is more general than just testing one particular
> way of hashed lookup. Exactly what kind of hashing you do, as long as
> it isn't bad, doesn't matter much. Maybe it would be possible to shave
> a nanosecond in the predictable case, but that's it. Misprediction
> would be always slow. The more complex your hash function is, no
> matter how theoretically good, the harder it is for branch prediction.
> The current linear probe works remarkably well, probably because it is
> very simple.
> 
> 8221828's proposal is similar to what I did, and I believe it would
> have similar performance. For this exercise, I used a small (16-way)
> embedded hash table in the Klass itself, with collisions handled by
> compressed hash tables. With a small number of interface methods we
> get few collisions, so its performance is very similar to any good
> hash table.
> 
> I did consider global numbering, but went for a 64-bit hash function
> based on the full name of the method. Simply using the address of the
> Method as a key would be the same as a global number in its runtime
> complexity.
> 
> The problem with embedding the entire hash table in a Klass is that of
> Klass size. I think it would be possible to compress code pointers to
> 32 bits, and also use 32-bit indexes, which would get that space back.
> But there is more pressure today, because of Liliput, on Klass size.
> 
> Cuckoo hashing is a nice trick, but it has its own problems. To begin
> with, there are two possible table entries for each target. Also,
> Cuckoo needs multiple hash functions and fallbacks for when hashing
> fails, which isn't ideal for a generic hash table like we need here.
> 
> What really matters is predictability, and that is more a function of
> user code. If the target of a megamorphic call really is random,
> that's going to hurt.
> 
> -- 
> Andrew Haley  (he/him)
> Java Platform Lead Engineer
> Red Hat UK Ltd. <https://www.redhat.com>
> https://keybase.io/andrewhaley
> EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671