A question about bytecodes + unsigned load performance ./. add performace
Christian Thalinger
Christian.Thalinger at Sun.COM
Fri Feb 6 04:08:15 PST 2009
On Wed, 2009-01-21 at 19:11 +0100, Christian Thalinger wrote:
> You were absolutely right. Simply summing up the bytes as unsigned
> values is 22% faster.
>
> I hope I can try it on a SPARC box soon.
I got a bit side-tracked but I'm still on it. Unfortunately my quota in
sfbay exceeded and I can't provide more test results, but at least I've
implemented unsigned load instructions on SPARC and ran the
DecoderBenchmark (btw. the builds are debug builds, but it shouldn't
make any difference):
$ gamma DecoderBenchmark
time for warm up 1: 30589 ms
time for warm up 2: 28152 ms
time for warm up 3: 28127 ms
time for warm up 4: 28127 ms
time for map[a & 0xFF]: 28083 ms
time for map[a + 0x80]: 28084 ms
time for inlined map[a & 0xFF]: 28082 ms
time for inlined map[a + 0x80]: 28082 ms
test loops ./. last warm up: 0.9984195
So, yes, the +0x80 trick is now superfluous. Now let's see what the
numbers are on x86 (solaris-amd64):
$ gamma DecoderBenchmark
time for warm up 1: 1790 ms
time for warm up 2: 1509 ms
time for warm up 3: 1431 ms
time for warm up 4: 1441 ms
time for map[a & 0xFF]: 1732 ms
time for map[a + 0x80]: 1439 ms
time for inlined map[a & 0xFF]: 1733 ms
time for inlined map[a + 0x80]: 1431 ms
test loops ./. last warm up: 1.0987756
The generated code is almost the same for both methods:
& 0xFF:
0a0 B5: # B5 B6 <- B4 B5 Loop: B5-B5 inner stride: not constant main of N16 Freq: 4.27176e+09
0a0 movslq RAX, R9 # i2l
0a3 movq R10, R11 # spill
0a6 addq R10, RAX # ptr
0a9 movzbl R14, [R10 + #31 (8-bit)] # ubyte
0ae movzbl R13, [R10 + #30 (8-bit)] # ubyte
0b3 movzbl RBX, [R10 + #29 (8-bit)] # ubyte
0b8 movzbl RDI, [R10 + #28 (8-bit)] # ubyte
0bd movzbl RSI, [R10 + #27 (8-bit)] # ubyte
0c2 movzbl RDX, [R10 + #26 (8-bit)] # ubyte
0c7 movzbl RBP, [R10 + #25 (8-bit)] # ubyte
0cc movzbl R10, [R10 + #24 (8-bit)] # ubyte
0d1 movzwl R10, [RCX + #24 + R10 << #1] # ushort/char
0d7 movw [R8 + #24 + RAX << #1], R10 # char/short
0dd movzwl R10, [RCX + #24 + RBP << #1] # ushort/char
0e3 movw [R8 + #26 + RAX << #1], R10 # char/short
0e9 movzwl RDX, [RCX + #24 + RDX << #1] # ushort/char
0ee movw [R8 + #28 + RAX << #1], RDX # char/short
0f4 movzwl RSI, [RCX + #24 + RSI << #1] # ushort/char
0f9 movw [R8 + #30 + RAX << #1], RSI # char/short
0ff movzwl RDI, [RCX + #24 + RDI << #1] # ushort/char
104 movw [R8 + #32 + RAX << #1], RDI # char/short
10a movzwl RBX, [RCX + #24 + RBX << #1] # ushort/char
10f movw [R8 + #34 + RAX << #1], RBX # char/short
115 movzwl RBX, [RCX + #24 + R13 << #1] # ushort/char
11b movw [R8 + #36 + RAX << #1], RBX # char/short
121 movzwl RBX, [RCX + #24 + R14 << #1] # ushort/char
127 movw [R8 + #38 + RAX << #1], RBX # char/short
12d addl R9, #8 # int
131 cmpl R9, #16377
138 jl B5 # loop end P=0.999943 C=1740800.000000
So I'd say this is more or less optimal code.
+ 0x80:
1d0 B12: # B12 B13 <- B11 B12 Loop: B12-B12 inner stride: not constant main of N173 Freq: 4.10656e+09
1d0 movslq RBX, R10 # i2l
1d3 movsbq RDI, [R11 + #24 + RBX] # byte -> long
1d9 movsbq RDX, [R11 + #31 + RBX] # byte -> long
1df movzwl RDI, [RCX + #280 + RDI << #1] # ushort/char
1e7 movw [R8 + #24 + RBX << #1], RDI # char/short
1ed movsbq RAX, [R11 + #30 + RBX] # byte -> long
1f3 movsbq RBP, [R11 + #29 + RBX] # byte -> long
1f9 movsbq R13, [R11 + #28 + RBX] # byte -> long
1ff movsbq R14, [R11 + #27 + RBX] # byte -> long
205 movsbq RSI, [R11 + #26 + RBX] # byte -> long
20b movsbq RDI, [R11 + #25 + RBX] # byte -> long
211 movzwl RDI, [RCX + #280 + RDI << #1] # ushort/char
219 movw [R8 + #26 + RBX << #1], RDI # char/short
21f movzwl RSI, [RCX + #280 + RSI << #1] # ushort/char
227 movw [R8 + #28 + RBX << #1], RSI # char/short
22d movzwl RDI, [RCX + #280 + R14 << #1] # ushort/char
236 movw [R8 + #30 + RBX << #1], RDI # char/short
23c movzwl RSI, [RCX + #280 + R13 << #1] # ushort/char
245 movw [R8 + #32 + RBX << #1], RSI # char/short
24b movzwl RDI, [RCX + #280 + RBP << #1] # ushort/char
253 movw [R8 + #34 + RBX << #1], RDI # char/short
259 movzwl RSI, [RCX + #280 + RAX << #1] # ushort/char
261 movw [R8 + #36 + RBX << #1], RSI # char/short
267 movzwl RDX, [RCX + #280 + RDX << #1] # ushort/char
26f movw [R8 + #38 + RBX << #1], RDX # char/short
275 addl R10, #8 # int
279 cmpl R10, #16377
280 jl B12 # loop end P=0.999940 C=1445888.000000
It's slightly better code and that's probably why it's faster. But
instruction-selection wise it's optimal code in both cases.
I just wonder why there is a I2L conversion in the first listing even
the loop variable in both listings is an integer register.
Also interesting is the fact that in the first listing the unsigned byte
values are loaded with a normal register loads:
0xfffffd7ffa306d89: movzbl 0x1f(%r10),%r14d
0xfffffd7ffa306d8e: movzbl 0x1e(%r10),%r13d
0xfffffd7ffa306d93: movzbl 0x1d(%r10),%ebx
0xfffffd7ffa306d98: movzbl 0x1c(%r10),%edi
0xfffffd7ffa306d9d: movzbl 0x1b(%r10),%esi
0xfffffd7ffa306da2: movzbl 0x1a(%r10),%edx
0xfffffd7ffa306da7: movzbl 0x19(%r10),%ebp
0xfffffd7ffa306dac: movzbl 0x18(%r10),%r10d
but in the second (faster) listing the loads are indexed loads:
0xfffffd7ffa306ecd: movsbq 0x1e(%r11,%rbx,1),%rax
0xfffffd7ffa306ed3: movsbq 0x1d(%r11,%rbx,1),%rbp
0xfffffd7ffa306ed9: movsbq 0x1c(%r11,%rbx,1),%r13
0xfffffd7ffa306edf: movsbq 0x1b(%r11,%rbx,1),%r14
0xfffffd7ffa306ee5: movsbq 0x1a(%r11,%rbx,1),%rsi
0xfffffd7ffa306eeb: movsbq 0x19(%r11,%rbx,1),%rdi
Register allocation?
-- Christian
More information about the hotspot-compiler-dev
mailing list