A question about bytecodes + unsigned load performance ./. add performace

Christian Thalinger Christian.Thalinger at Sun.COM
Mon Jan 19 06:43:13 PST 2009


On Fri, 2009-01-16 at 12:29 -0800, John Rose wrote:
> Yes.  It's a valid ideal node type, not hardware specific and useful  
> to optimizations.

As I've already written on hotspot-dev, the optimization works but it's
generally not faster.

I'm not sure yet why this is the case, as the new code is denser.  Maybe
both codes get translated to the same micro-ops, but then the
performance should be at least equal.

The following example is with my changes (Intel Core2 Duo T9300 @
2.5GHz):

time for map[a & 0xFF]: 1525 ms
time for map[a + 0x80]: 1461 ms

The first one boils down to:

  0xfffffd7ffa3029d2: movzbl 0x19(%r10),%r10d
  0xfffffd7ffa3029d7: movzwl 0x18(%r14,%r10,2),%r10d  ;*caload
  0xfffffd7ffa3029dd: mov    %r10w,0x1a(%rbp,%rdi,2)  ;*castore

and the second to:

  0xfffffd7ffa302b24: movsbl 0x19(%r13,%rdi,1),%r10d
  0xfffffd7ffa302b4e: movslq %r10d,%r10
  0xfffffd7ffa302b51: movzwl 0x118(%r14,%r10,2),%r10d  ;*caload
  0xfffffd7ffa302b5a: mov    %r10w,0x1a(%rbp,%rdi,2)  ;*castore

Maybe out-of-order execution and micro-ops optimizations (I don't know
if there are any in an Intel CPU) can combine movsbl and movslq to one
micro-op, but still, both variants should have the same performance.
Generating movzbq instead of movzbl gives:

time for map[a & 0xFF]: 1533 ms

  0xfffffd7ffa303312: movzbq 0x19(%r10),%r10
  0xfffffd7ffa303317: movzwl 0x18(%r14,%r10,2),%r10d  ;*caload
  0xfffffd7ffa30331d: mov    %r10w,0x1a(%rbp,%rdi,2)  ;*castore

However, I think we should integrate my changes as it opens up the
possibility for new optimizations more easily, e.g. superword.  The
unrolled loop could then use a code sequence like:

pxor      xmm7, xmm7
movdqa    xmm1, xmm0 ; copy source
punpcklbw xmm0, xmm7 ; unpack the 8 low-end bytes
                     ; into 8 zero-extended 16-bit words
punpckhbw xmm1, xmm7 ; unpack the 8 high-end bytes
                     ; into 8 zero-extended 16-bit words

Processing 8 or 16 values at once.  And that should definitely be
faster...

-- Christian




More information about the hotspot-compiler-dev mailing list