A question about bytecodes + unsigned load performance ./. add performace
Christian Thalinger
Christian.Thalinger at Sun.COM
Mon Jan 19 09:37:58 PST 2009
On Mon, 2009-01-19 at 15:43 +0100, Christian Thalinger wrote:
> As I've already written on hotspot-dev, the optimization works but it's
> generally not faster.
On i486 things are looking much better with unsigned-byte loads:
time for map[a & 0xFF]: 2510 ms
time for map[a + 0x80]: 1875 ms
vs.
time for map[a & 0xFF]: 2102 ms
time for map[a + 0x80]: 1870 ms
This is a 20% speedup. But it's still slower than the +128 trick. The
reason for this seems to be that the generated code is very different.
When using unsigned-byte loads the values are loaded in a burst and
stored in temporary locations, like on amd64. Unfortunately (or
obviously) these temporaries are on the stack:
0xf90d8b5a: movzbl 0x12(%edi),%ecx
0xf90d8b5e: mov %ecx,0x18(%esp)
0xf90d8b62: movzbl 0x11(%edi),%ecx
0xf90d8b66: mov %ecx,0x1c(%esp)
0xf90d8b6a: movzbl 0x10(%edi),%ecx
0xf90d8b6e: mov %ecx,0x24(%esp)
0xf90d8b72: movzbl 0xf(%edi),%ecx
0xf90d8b76: mov %ecx,0x28(%esp)
0xf90d8b7a: movzbl 0xe(%edi),%ecx
0xf90d8b7e: mov %ecx,0x2c(%esp)
0xf90d8b82: movzbl 0xd(%edi),%ecx
0xf90d8b86: mov %ecx,0x34(%esp)
And later loaded again:
0xf90d8b98: mov 0x34(%esp),%ecx
0xf90d8b9c: movzwl 0xc(%ebx,%ecx,2),%edi ;*caload
This is done for all 8 unrolled loops. While the +128 code is more
interleaved and uses a temporary location only for 4 of the 8 unrolled
loops. The other 4 are using:
0xf90d8c8d: movsbl 0xe(%esi,%edi,1),%eax
...
0xf90d8cac: movzwl 0x10c(%ebx,%eax,2),%eax ;*caload
0xf90d8cb4: mov %ax,0x10(%ebp,%edi,2) ;*castore
Maybe someone can explain to me why the generated code is so different?
-- Christian
More information about the hotspot-compiler-dev
mailing list