Bimorphic inlining not applied at a call site that was initially monomorphic

Sun Feb 18 16:31:24 UTC 2024

Hi all! Recently, I was looking into a severe performance regression in a
library.
The regression was introduced when some methods were hoisted from final
classes to their common abstract base class,
so that hoisted methods continue calling some abstract methods defined only
in subclasses.
I was able to reproduce the issue with JDK 21, as well as with the JDK
built from the main branch recently.

Turns out, if there is a bimorphic virtual call site that was monomorphic
at the time a method was compiled by the C1 at 3rd tier
and the C1 was capable of resolving a virtual call and inline it, then when
the call site became bimorphic,
C2 won't be able to perform bimorphic inlining at that call site.
If the C1 can't inline a virtual call at the initially monomorphic call
site or the call site is bimorphic right from the beginning,
C2 successfully performs biomorphic inlining.

I added a small benchmark to reproduce the issue [1]. The benchmark starts
with a monomorphic call site and
after a while, it makes it bimorphic by loading a second class and starts
calling a target method on instances
of two sibling classes.
There are two benchmark methods: staticallyResolvableTarget and
staticallyUnresolvableTarget.
The first one uses a class hierarchy such that the C1 is capable of
inlining a virtual call as long as
the call site is monomorphic, and in the second case - the C1 can't (a
target method is package-private).
The first case's performance is worse as the C2 can't apply bimorphic
inlining at the end:

> Benchmark
 (alwaysBimorphic)  Mode  Cnt  Score   Error  Units
> BimorphicInliningBenchmark.staticallyResolvableTarget
 false  avgt   25  3.447 ± 0.009  ns/op
> BimorphicInliningBenchmark.staticallyUnresolvableTarget
 false  avgt   25  3.152 ± 0.005  ns/op

For the slow case (staticallyResolvableTarget), the compilation and
inlining sequence looks as follows:

> 568 3 org.example.BimorphicInliningBenchmark::staticallyResolvableTarget
(19 bytes)(code size: 688)
>     @ 15 org.example.ClassHierarchyA::callSiteHolder succeed: inline (end
time: 0.2520)
>       @ 1 org.example.ClassHierarchyA$SubclassA::inlinee succeed: inline
(end time: 0.2520)
>
> 572 4 org.example.BimorphicInliningBenchmark::staticallyResolvableTarget
(19 bytes)(code size: 488)
>     @ 15 org.example.ClassHierarchyA::callSiteHolder succeed: inline
(hot) (end time: 0.2520)
>       @ 1 org.example.ClassHierarchyA$SubclassA::inlinee succeed: inline
(hot) (end time: 0.2520)
>
> 572 make_not_entrant // the second class was loaded
>
> 616 4 org.example.BimorphicInliningBenchmark::staticallyResolvableTarget
(19 bytes)(code size: 488)
>     @ 15 org.example.ClassHierarchyA::callSiteHolder succeed: inline
(hot) (end time: 4.2740)
>       @ 1 org.example.ClassHierarchyA::inlinee fail: virtual call (end
time: 0.0000)
>         type profile org.example.ClassHierarchyA ->
org.example.ClassHierarchyA$SubclassA (19%)

Call site's profiling data looks as follows:

>   1 invokevirtual 3 <org/example/ClassHierarchyA.inlinee()I>
> 0 bci: 1 VirtualCallData count(25760) nonprofiled_count(0) entries(2)
>
'org/example/ClassHierarchyA$SubclassA'(9033 0.21)
>
'org/example/ClassHierarchyA$SubclassB'(8613 0.20)

Per-type counters were collected by the interpreter, and the regular
counter ("count(25760)")
was incremented by the C1-compiled code. It seems like such a combination
of counter-values stops the C2
from treating the call site as bimorphic [2][3] and the inlining doesn't
happen (which is fair,
as counters look the same in the case of a megamorphic call site).

However, if the C1 can't inline a virtual call at the initially monomorphic
call site,
counters will look slightly different, and the C2 will end up doing
inlining:

> 0 bci: 1 VirtualCallData trap/
org.example.BimorphicInliningBenchmark::staticallyUnresolvableTarget(class_check
recompiled) count(0) nonprofiled_count(0) entries(2)
>
'org/example/ClassHierarchyB$SubclassA'(34797 0.78)
>
'org/example/ClassHierarchyB$SubclassB'(9843 0.22)

> 619 4
org.example.BimorphicInliningBenchmark::staticallyUnresolvableTarget (19
bytes)(code size: 512)
>     @ 15 org.example.ClassHierarchyB::callSiteHolder succeed: inline
(hot) (end time: 4.1750)
>       @ 1 org.example.ClassHierarchyB::inlinee (0 bytes) (end time:
0.0000)
>         type profile org.example.ClassHierarchyB ->
org.example.ClassHierarchyB$SubclassA (95%)
>       @ 1 org.example.ClassHierarchyB$SubclassA::inlinee succeed: inline
(hot) (end time: 4.1750)
>       @ 1 org.example.ClassHierarchyB$SubclassB::inlinee succeed: inline
(hot) (end time: 4.1750)

You can find both benchmarking results and compilation logs in the repo
along with the benchmark [4].

The question is if such a behavior is intentional (I didn't find an issue
suggesting the opposite)
and if it is, what differs it from scenarios where a call site is always
bimorphic and where the C2
successfully performs inlining (or where the C1 simply can't inline a
monomorphic vcall initially)?

Thanks in advance,
Filipp.

[1] https://github.com/fzhinkin/bimorphic-inlining-issue/tree/main/
[2]
https://github.com/openjdk/jdk/blob/7004c2724d9b150112c66febb7f24b781ff379dd/src/hotspot/share/ci/ciMethod.cpp#L489
[3]
https://github.com/openjdk/jdk/blob/7004c2724d9b150112c66febb7f24b781ff379dd/src/hotspot/share/ci/ciMethod.cpp#L510
[4] https://github.com/fzhinkin/bimorphic-inlining-issue/tree/main/results
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-compiler-dev/attachments/20240218/477225f8/attachment-0001.htm>