How to ensure that a call to a virtual method in a single method is looked up in the virtual function table only once?
Krystal Mok
rednaxelafx at gmail.com
Wed Oct 20 07:09:39 UTC 2021
Hi Glavo,
Happened to bump into this thread. Some thoughts:
You're probably looking for a resulting code shape that looks like this:
private static void fun0(Iterator<?> it) {
InstanceKlass kls = it._klass;
if (kls != SomeExpectedIteratorImpl) uncommon_trap();
// CheckCastPP it to SomeExpectedIteratorImpl in the rest of the method
// CHA devirtualize it.hasNext() and it.next() to the concrete impl
methods on SomeExpectedIteratorImpl
while (it.hasNext()) {
it.next();
}
}
C2's infrastructure is actually well capable of doing this.
Just curious: what types are fed into this example method from the caller?
BTW, those call sites in the C2-compiled code don't seem like vtable
dispatches. Rather, they're compiled inline-caches (CompileIC). The call
site makes a call to the "unverified entry point" (UEP) of the target
method, and the UEP code makes 1 direct type check and either rejects the
call (bounces to the slowpath virtual/interface method lookup), or the type
check passes and you're in to the "verified entry point" (VEP) which
performs the regular method entry point stuff like stack banging and stack
frame setup.
; inline cache code
0x00007fd4efcfea31: movabs $0x80000b318,%rax
0x00007fd4efcfea3b: call 0x00007fd4e8129ce0
the target (UEP) likely looks something like this:
0x00007fd4e8129ce0: mov 0x8(%rsi),%r10d ; load narrow klass (when
UseCompressedClassPointers)
0x00007fd4e8129ce4: shl $0x3,%r10 ; decode narrow klass to
regular klass
0x00007fd4e8129ce8: cmp %r10,%rax ; <- this is the direct
type check in UEP
0x00007fd4e8129ceb: jne 0x7fcff1045ca0 ; {runtime_call}
One thing worth mentioning is that there's a tradeoff made in C2 to
encapsulate the virtual dispatch logic such that the GetKlass(o) operation
(what I wrote in pseudocode above as it._klass) is not directly exposed in
the IR, not event on the MachNode level. So C2 wouldn't have the chance to
remove redundant computation of obj->_klass.vtable even without the
compiled inline caches (i.e. directly generating vtable dispatch).
- Kris
On Tue, Oct 19, 2021 at 8:09 PM Glavo <zjx001202 at gmail.com> wrote:
> (I am a novice in HotSpot, JIT compiler and assembly, if there is some
> misunderstanding, please forgive me.)
>
> I designed a very simple micro benchmark to test the performance of virtual
> function calls (Test.java: https://paste.ubuntu.com/p/VKc57mnWgp/).
>
> The method be tested is a very simple method:
>
> private static void fun0(Iterator<?> it) {
> while (it.hasNext()) it.next();
> }
>
> I know C2 will try to specialize the implementation for a few parameter
> types to completely de virtualize, so I call it with many different
> iterator types to avoid being de virtualized.
>
> I installed hsdis for my JDK, ran it with `java
> -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,Test::fun0
> Test.java`, and got this ASM code:
>
> https://paste.ubuntu.com/p/k2mg2svdcx/
>
> To my great surprise, it seems that every time a virtual method is called,
> it needs to be looked up in the virtual function table.
>
> In my understanding, this is unnecessary: The parameter `it` remains
> unchanged, so for calling a virtual method, we should only lookup it in the
> virtual function table on the first call and cache the function pointer on
> the register or stack. Each subsequent call can be made through the
> function pointer without virtual call.
>
> Unfortunately, it seems that C2 did not make this optimization for me, and
> I think this will cause a noticeable performance degradation to my code.
>
> I tried to use MethodHandle instead of virtual calls in methods, and
> designed microbenchmarks using JMH. The result disappointed me. It is
> always slower than making a virtual call directly.
>
> I have some questions about this:
>
> Will the resulting large number of virtual calls significantly degrade
> function performance?
>
> Can I use existing functions (like MethodHandle) in my code to avoid this
> overhead?
>
> Why doesn't C2 do this optimization? Because it's hard to implement?
> Because there is a lot of analysis overhead? Or something else?
>
More information about the hotspot-dev
mailing list