Redesign interface dispatch: Coda

Mon Nov 18 15:47:54 UTC 2024

On 11/15/24 09:33, Erik Osterlund wrote:

 > So a big thank you to you for picking this up and thoroughly
 > investigating it. The answer seems to be that there is no such
 > missed opportunity, at this time. To me, your negative result is
 > very important. I think the key takeaway is that unpredictable call
 > patterns are going to fundamentally cause performance trouble
 > anyway, making even the most carefully tuned instruction sequences
 > slow, because of the inherent unpredictability that hardware can’t
 > cope with very well.

Indeed so, and as hardware gets wider and wider the gap between
predictable and unpredictable code performance continues to increase.
Given that we can execute up to 10 instructions in parallel on some
modern processors, and in practice sustain 6-7 instructions per clock
on some, the gap must continue to widen.

I was wondering what the fastest possible sequence of instructions to
execute this benchmark might be, and I think I've found it.

This is the standard benchmark, running on an 8-wide Apple M1
processor: [1]

     @Benchmark
     public int test1stInt5Types() {
         FirstInterface ai = as[step(asLength)];
         return ai.getIntFirst();
     }

Benchmark                (randomized)  Mode  Cnt   Score   Error      Units
test1stInt5Types                false  avgt    4   3.755 ± 0.016      ns/op
test1stInt5Types:IPC            false  avgt        6.581          insns/clk
test1stInt5Types:cycles:u       false  avgt       12.013               #/op
test1stInt5Types:instructions:u false  avgt       79.058               #/op

test1stInt5Types                 true  avgt    4  19.408 ± 0.041      ns/op
test1stInt5Types:IPC             true  avgt        1.324          insns/clk
test1stInt5Types:cycles:u        true  avgt       62.222               #/op
test1stInt5Types:instructions:u  true  avgt       82.378               #/op

Performance is very good when well predicted, 6.5 ish IPC. 1.3 IPC
(very bad) when mispredicted. (Adbout 6.5 IPC seems to the limit on
this CPU; at least I've never seen anything faster.)

We can transform the benchmark to this, based on the types of the operands:

     @Benchmark
     public int test1stInt5TypesSwitch() {
         FirstInterface ai = as[step(asLength)];
         return switch(ai) {
         case FirstClass o -> o.getIntFirst();
         case SecondClass o -> o.getIntFirst();
         case ThirdClass o -> o.getIntFirst();
         case FourthClass o -> o.getIntFirst();
         case FifthClass o -> o.getIntFirst();
         default -> ai.getIntFirst();
         };
     }

C2 generates excellent code for this version of the benchmark.
Everything is inlined, so there is no call sequence, spills and fills,
stubs, entry or exit barriers, no post-call NOPS, and so on:

Benchmark               (randomized)  Mode  Cnt   Score   Error   Units
Int5TypesSwitch                false  avgt    4   3.124 ± 0.068   ns/op
Int5TypesSwitch:cycles:        false  avgt        9.987            #/op
Int5TypesSwitch:IPC            false  avgt        4.850       insns/clk
Int5TypesSwitch:instructions:u false  avgt       48.443            #/op

Int5TypesSwitch                 true  avgt    4  13.128 ± 0.052   ns/op
Int5TypesSwitch:cycles:u        true  avgt       42.037            #/op
Int5TypesSwitch:IPC             true  avgt        1.201       insns/clk
Int5TypesSwitch:instructions:u  true  avgt       50.482            #/op

The really interesting thing for me, though, is that there is almost
no difference in execution time between the two versions despite the
switch version of the benchmark's greatly reduced instruction count -
79 versus 48.5. Despite executing far fewer instructions, the time per
iteration is only reduced from 12 cycles to 10 cycles.

This result answers a mystery that has been bothering me. Why, despite
the substantial increase in per-call overhead over the last couple of
years (entry and exit barriers + post-call NOPs) have we not seen a
deterioration in performance? This is why. When we have added all that
overhead, processors execute it in slots that would otherwise have
nothing to do.

(As Feynman put it in perhaps the greatest mic-drop moment ever, "I
believe that has some significance for our problem.")

It also answers another thing I've noticed several times. Why,
sometimes, are JMH warmup times (mostly C1-compiled) faster than
C2-compiled? I strongly suspect it's because stupid C1-compiled
sequences may sometimes be more predictable than clever C2-compiled
sequences.

Finally, a note of caution. We can't keep adding overhead without
seeing real-world performance degrade, because we'll run out of
parallel resources. I don't know when that will be.

[1] See https://x.com/Cardyak/status/1375164173282377730/photo/1 for
more details about the architecture of the CPU.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671