Request for discussion: rewrite invokeinterface dispatch, JMH benchmark

Wed Oct 9 14:34:46 UTC 2024

On 10/9/24 10:18, Dmitry Chuyko wrote:
 > Your observations are quite interesting. If you remember
 > https://github.com/openjdk/jdk/pull/13460, example micro-benchmark
 > improvements for x86 were ~10% and only ~3% in Naive Bayes.

In addition, if we compare and contrast your figures with my (rather
old) desktop machine, we see this:

Your benchmark results, from the PR, before and after in two columns, ns/op:

CPU: AMD EPYC 7502P (2019)
InterfaceCalls.test1stInt2Types    5.157    5.135    0.43%
InterfaceCalls.test1stInt3Types    9.882    9.807    0.76%
InterfaceCalls.test1stInt5Types    9.864    9.802    0.63%
InterfaceCalls.test2ndInt2Types    6.664    5.432   18.49%
InterfaceCalls.test2ndInt3Types   10.411   10.046    3.51%
InterfaceCalls.test2ndInt5Types   10.49    10.075    3.96%
InterfaceCalls.testIfaceCall      46.789   46.72     0.15%
InterfaceCalls.testIfaceExtCall   50.724   46.55     8.23%
InterfaceCalls.testMonomorphic     4.823    4.826    0.06%

My results, today, JDK head, AMD Ryzen Threadripper 2950X (2018) is much more
like the Apple M1:

InterfaceCalls.test1stInt2Types  2.172
InterfaceCalls.test1stInt3Types  5.721
InterfaceCalls.test1stInt5Types  6.468
InterfaceCalls.test2ndInt2Types  2.202
InterfaceCalls.test2ndInt3Types  5.981
InterfaceCalls.test2ndInt5Types  5.992
InterfaceCalls.testIfaceCall     5.722
InterfaceCalls.testIfaceExtCall  5.947
InterfaceCalls.testMonomorphic   0.990

I think the 2950X has a faster clock, but the dramatic thing is that
testIface* are the same speed as all the other "real" interface calls,
at 6ns. I guess 2950X also has better branch prediction.

Let's try to test that guess on 2950X:
                                            Regular    Scrambled
InterfaceCalls.test2ndInt5Types            5.985  20.853
InterfaceCalls.test2ndInt5TypesScrambled

:-)

Looking at some more detailed stats, the scrambled version does twice as many
memory loads and has a missed branch on each iteration, suggesting that the
CPU always speculates an entire iteration, gets it wrong, then has to do it
all again.
                                                         Regular    Scrambled
InterfaceCalls.test2ndInt5Types:L1-dcache-loads:u          31.975     60.393
InterfaceCalls.test2ndInt5Types:branch-misses:u            ≈ 10⁻⁴      1.263

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671