A question about bytecodes + unsigned load performance ./. add performace

Mon Jan 19 09:37:58 PST 2009

On Mon, 2009-01-19 at 15:43 +0100, Christian Thalinger wrote:
> As I've already written on hotspot-dev, the optimization works but it's
> generally not faster.

On i486 things are looking much better with unsigned-byte loads:

time for map[a & 0xFF]: 2510 ms
time for map[a + 0x80]: 1875 ms

vs.

time for map[a & 0xFF]: 2102 ms
time for map[a + 0x80]: 1870 ms

This is a 20% speedup.  But it's still slower than the +128 trick.  The
reason for this seems to be that the generated code is very different.

When using unsigned-byte loads the values are loaded in a burst and
stored in temporary locations, like on amd64.  Unfortunately (or
obviously) these temporaries are on the stack:

  0xf90d8b5a: movzbl 0x12(%edi),%ecx
  0xf90d8b5e: mov    %ecx,0x18(%esp)
  0xf90d8b62: movzbl 0x11(%edi),%ecx
  0xf90d8b66: mov    %ecx,0x1c(%esp)
  0xf90d8b6a: movzbl 0x10(%edi),%ecx
  0xf90d8b6e: mov    %ecx,0x24(%esp)
  0xf90d8b72: movzbl 0xf(%edi),%ecx
  0xf90d8b76: mov    %ecx,0x28(%esp)
  0xf90d8b7a: movzbl 0xe(%edi),%ecx
  0xf90d8b7e: mov    %ecx,0x2c(%esp)
  0xf90d8b82: movzbl 0xd(%edi),%ecx
  0xf90d8b86: mov    %ecx,0x34(%esp)

And later loaded again:

  0xf90d8b98: mov    0x34(%esp),%ecx
  0xf90d8b9c: movzwl 0xc(%ebx,%ecx,2),%edi  ;*caload

This is done for all 8 unrolled loops.  While the +128 code is more
interleaved and uses a temporary location only for 4 of the 8 unrolled
loops.  The other 4 are using:

  0xf90d8c8d: movsbl 0xe(%esi,%edi,1),%eax
...
  0xf90d8cac: movzwl 0x10c(%ebx,%eax,2),%eax  ;*caload
  0xf90d8cb4: mov    %ax,0x10(%ebp,%edi,2)  ;*castore

Maybe someone can explain to me why the generated code is so different?

-- Christian