Redesign interface dispatch: Coda
Andrew Haley
aph-open at littlepinkcloud.com
Mon Nov 18 15:47:54 UTC 2024
On 11/15/24 09:33, Erik Osterlund wrote:
> So a big thank you to you for picking this up and thoroughly
> investigating it. The answer seems to be that there is no such
> missed opportunity, at this time. To me, your negative result is
> very important. I think the key takeaway is that unpredictable call
> patterns are going to fundamentally cause performance trouble
> anyway, making even the most carefully tuned instruction sequences
> slow, because of the inherent unpredictability that hardware can’t
> cope with very well.
Indeed so, and as hardware gets wider and wider the gap between
predictable and unpredictable code performance continues to increase.
Given that we can execute up to 10 instructions in parallel on some
modern processors, and in practice sustain 6-7 instructions per clock
on some, the gap must continue to widen.
I was wondering what the fastest possible sequence of instructions to
execute this benchmark might be, and I think I've found it.
This is the standard benchmark, running on an 8-wide Apple M1
processor: [1]
@Benchmark
public int test1stInt5Types() {
FirstInterface ai = as[step(asLength)];
return ai.getIntFirst();
}
Benchmark (randomized) Mode Cnt Score Error Units
test1stInt5Types false avgt 4 3.755 ± 0.016 ns/op
test1stInt5Types:IPC false avgt 6.581 insns/clk
test1stInt5Types:cycles:u false avgt 12.013 #/op
test1stInt5Types:instructions:u false avgt 79.058 #/op
test1stInt5Types true avgt 4 19.408 ± 0.041 ns/op
test1stInt5Types:IPC true avgt 1.324 insns/clk
test1stInt5Types:cycles:u true avgt 62.222 #/op
test1stInt5Types:instructions:u true avgt 82.378 #/op
Performance is very good when well predicted, 6.5 ish IPC. 1.3 IPC
(very bad) when mispredicted. (Adbout 6.5 IPC seems to the limit on
this CPU; at least I've never seen anything faster.)
We can transform the benchmark to this, based on the types of the operands:
@Benchmark
public int test1stInt5TypesSwitch() {
FirstInterface ai = as[step(asLength)];
return switch(ai) {
case FirstClass o -> o.getIntFirst();
case SecondClass o -> o.getIntFirst();
case ThirdClass o -> o.getIntFirst();
case FourthClass o -> o.getIntFirst();
case FifthClass o -> o.getIntFirst();
default -> ai.getIntFirst();
};
}
C2 generates excellent code for this version of the benchmark.
Everything is inlined, so there is no call sequence, spills and fills,
stubs, entry or exit barriers, no post-call NOPS, and so on:
Benchmark (randomized) Mode Cnt Score Error Units
Int5TypesSwitch false avgt 4 3.124 ± 0.068 ns/op
Int5TypesSwitch:cycles: false avgt 9.987 #/op
Int5TypesSwitch:IPC false avgt 4.850 insns/clk
Int5TypesSwitch:instructions:u false avgt 48.443 #/op
Int5TypesSwitch true avgt 4 13.128 ± 0.052 ns/op
Int5TypesSwitch:cycles:u true avgt 42.037 #/op
Int5TypesSwitch:IPC true avgt 1.201 insns/clk
Int5TypesSwitch:instructions:u true avgt 50.482 #/op
The really interesting thing for me, though, is that there is almost
no difference in execution time between the two versions despite the
switch version of the benchmark's greatly reduced instruction count -
79 versus 48.5. Despite executing far fewer instructions, the time per
iteration is only reduced from 12 cycles to 10 cycles.
This result answers a mystery that has been bothering me. Why, despite
the substantial increase in per-call overhead over the last couple of
years (entry and exit barriers + post-call NOPs) have we not seen a
deterioration in performance? This is why. When we have added all that
overhead, processors execute it in slots that would otherwise have
nothing to do.
(As Feynman put it in perhaps the greatest mic-drop moment ever, "I
believe that has some significance for our problem.")
It also answers another thing I've noticed several times. Why,
sometimes, are JMH warmup times (mostly C1-compiled) faster than
C2-compiled? I strongly suspect it's because stupid C1-compiled
sequences may sometimes be more predictable than clever C2-compiled
sequences.
Finally, a note of caution. We can't keep adding overhead without
seeing real-world performance degrade, because we'll run out of
parallel resources. I don't know when that will be.
[1] See https://x.com/Cardyak/status/1375164173282377730/photo/1 for
more details about the architecture of the CPU.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-dev
mailing list