A question about bytecodes + unsigned load performance ./. add performace

Fri Feb 6 04:08:15 PST 2009

On Wed, 2009-01-21 at 19:11 +0100, Christian Thalinger wrote:
> You were absolutely right.  Simply summing up the bytes as unsigned
> values is 22% faster.
> 
> I hope I can try it on a SPARC box soon.

I got a bit side-tracked but I'm still on it.  Unfortunately my quota in
sfbay exceeded and I can't provide more test results, but at least I've
implemented unsigned load instructions on SPARC and ran the
DecoderBenchmark (btw. the builds are debug builds, but it shouldn't
make any difference):

$ gamma DecoderBenchmark
time for warm up 1: 30589 ms
time for warm up 2: 28152 ms
time for warm up 3: 28127 ms
time for warm up 4: 28127 ms
time for map[a & 0xFF]: 28083 ms
time for map[a + 0x80]: 28084 ms
time for inlined map[a & 0xFF]: 28082 ms
time for inlined map[a + 0x80]: 28082 ms
test loops ./. last warm up: 0.9984195

So, yes, the +0x80 trick is now superfluous.  Now let's see what the
numbers are on x86 (solaris-amd64):

$ gamma DecoderBenchmark
time for warm up 1: 1790 ms
time for warm up 2: 1509 ms
time for warm up 3: 1431 ms
time for warm up 4: 1441 ms
time for map[a & 0xFF]: 1732 ms
time for map[a + 0x80]: 1439 ms
time for inlined map[a & 0xFF]: 1733 ms
time for inlined map[a + 0x80]: 1431 ms
test loops ./. last warm up: 1.0987756

The generated code is almost the same for both methods:

& 0xFF:

0a0   B5: #     B5 B6 <- B4 B5  Loop: B5-B5 inner stride: not constant main of N16 Freq: 4.27176e+09
0a0     movslq  RAX, R9 # i2l
0a3     movq    R10, R11        # spill
0a6     addq    R10, RAX        # ptr
0a9     movzbl  R14, [R10 + #31 (8-bit)]        # ubyte
0ae     movzbl  R13, [R10 + #30 (8-bit)]        # ubyte
0b3     movzbl  RBX, [R10 + #29 (8-bit)]        # ubyte
0b8     movzbl  RDI, [R10 + #28 (8-bit)]        # ubyte
0bd     movzbl  RSI, [R10 + #27 (8-bit)]        # ubyte
0c2     movzbl  RDX, [R10 + #26 (8-bit)]        # ubyte
0c7     movzbl  RBP, [R10 + #25 (8-bit)]        # ubyte
0cc     movzbl  R10, [R10 + #24 (8-bit)]        # ubyte
0d1     movzwl  R10, [RCX + #24 + R10 << #1]    # ushort/char
0d7     movw    [R8 + #24 + RAX << #1], R10     # char/short
0dd     movzwl  R10, [RCX + #24 + RBP << #1]    # ushort/char
0e3     movw    [R8 + #26 + RAX << #1], R10     # char/short
0e9     movzwl  RDX, [RCX + #24 + RDX << #1]    # ushort/char
0ee     movw    [R8 + #28 + RAX << #1], RDX     # char/short
0f4     movzwl  RSI, [RCX + #24 + RSI << #1]    # ushort/char
0f9     movw    [R8 + #30 + RAX << #1], RSI     # char/short
0ff     movzwl  RDI, [RCX + #24 + RDI << #1]    # ushort/char
104     movw    [R8 + #32 + RAX << #1], RDI     # char/short
10a     movzwl  RBX, [RCX + #24 + RBX << #1]    # ushort/char
10f     movw    [R8 + #34 + RAX << #1], RBX     # char/short
115     movzwl  RBX, [RCX + #24 + R13 << #1]    # ushort/char
11b     movw    [R8 + #36 + RAX << #1], RBX     # char/short
121     movzwl  RBX, [RCX + #24 + R14 << #1]    # ushort/char
127     movw    [R8 + #38 + RAX << #1], RBX     # char/short
12d     addl    R9, #8  # int
131     cmpl    R9, #16377
138     jl     B5       # loop end  P=0.999943 C=1740800.000000

So I'd say this is more or less optimal code.

+ 0x80:

1d0   B12: #    B12 B13 <- B11 B12      Loop: B12-B12 inner stride: not constant main of N173 Freq: 4.10656e+09
1d0     movslq  RBX, R10        # i2l
1d3     movsbq  RDI, [R11 + #24 + RBX]  # byte -> long
1d9     movsbq  RDX, [R11 + #31 + RBX]  # byte -> long
1df     movzwl  RDI, [RCX + #280 + RDI << #1]   # ushort/char
1e7     movw    [R8 + #24 + RBX << #1], RDI     # char/short
1ed     movsbq  RAX, [R11 + #30 + RBX]  # byte -> long
1f3     movsbq  RBP, [R11 + #29 + RBX]  # byte -> long
1f9     movsbq  R13, [R11 + #28 + RBX]  # byte -> long
1ff     movsbq  R14, [R11 + #27 + RBX]  # byte -> long
205     movsbq  RSI, [R11 + #26 + RBX]  # byte -> long
20b     movsbq  RDI, [R11 + #25 + RBX]  # byte -> long
211     movzwl  RDI, [RCX + #280 + RDI << #1]   # ushort/char
219     movw    [R8 + #26 + RBX << #1], RDI     # char/short
21f     movzwl  RSI, [RCX + #280 + RSI << #1]   # ushort/char
227     movw    [R8 + #28 + RBX << #1], RSI     # char/short
22d     movzwl  RDI, [RCX + #280 + R14 << #1]   # ushort/char
236     movw    [R8 + #30 + RBX << #1], RDI     # char/short
23c     movzwl  RSI, [RCX + #280 + R13 << #1]   # ushort/char
245     movw    [R8 + #32 + RBX << #1], RSI     # char/short
24b     movzwl  RDI, [RCX + #280 + RBP << #1]   # ushort/char
253     movw    [R8 + #34 + RBX << #1], RDI     # char/short
259     movzwl  RSI, [RCX + #280 + RAX << #1]   # ushort/char
261     movw    [R8 + #36 + RBX << #1], RSI     # char/short
267     movzwl  RDX, [RCX + #280 + RDX << #1]   # ushort/char
26f     movw    [R8 + #38 + RBX << #1], RDX     # char/short
275     addl    R10, #8 # int
279     cmpl    R10, #16377
280     jl     B12      # loop end  P=0.999940 C=1445888.000000

It's slightly better code and that's probably why it's faster.  But
instruction-selection wise it's optimal code in both cases.

I just wonder why there is a I2L conversion in the first listing even
the loop variable in both listings is an integer register.

Also interesting is the fact that in the first listing the unsigned byte
values are loaded with a normal register loads:

  0xfffffd7ffa306d89: movzbl 0x1f(%r10),%r14d
  0xfffffd7ffa306d8e: movzbl 0x1e(%r10),%r13d
  0xfffffd7ffa306d93: movzbl 0x1d(%r10),%ebx
  0xfffffd7ffa306d98: movzbl 0x1c(%r10),%edi
  0xfffffd7ffa306d9d: movzbl 0x1b(%r10),%esi
  0xfffffd7ffa306da2: movzbl 0x1a(%r10),%edx
  0xfffffd7ffa306da7: movzbl 0x19(%r10),%ebp
  0xfffffd7ffa306dac: movzbl 0x18(%r10),%r10d

but in the second (faster) listing the loads are indexed loads:

  0xfffffd7ffa306ecd: movsbq 0x1e(%r11,%rbx,1),%rax
  0xfffffd7ffa306ed3: movsbq 0x1d(%r11,%rbx,1),%rbp
  0xfffffd7ffa306ed9: movsbq 0x1c(%r11,%rbx,1),%r13
  0xfffffd7ffa306edf: movsbq 0x1b(%r11,%rbx,1),%r14
  0xfffffd7ffa306ee5: movsbq 0x1a(%r11,%rbx,1),%rsi
  0xfffffd7ffa306eeb: movsbq 0x19(%r11,%rbx,1),%rdi

Register allocation?

-- Christian