A question about bytecodes + unsigned load performance ./. add performace

Tue Jan 20 11:22:53 PST 2009

On Jan 19, 2009, at 6:43 AM, Christian Thalinger wrote:

> On Fri, 2009-01-16 at 12:29 -0800, John Rose wrote:
>> Yes.  It's a valid ideal node type, not hardware specific and useful
>> to optimizations.
>
> As I've already written on hotspot-dev, the optimization works but  
> it's
> generally not faster.

I think on an out of order machine that much of this simply gets  
hidden, especially when there's a lot of load and store traffic.  Try  
measuring a loop without stores, maybe something that simply sums the   
results.  Sparc would benefit from this much more since it's in order  
and sign extending loads introduce a bubble in the pipeline.

tom

>
>
> I'm not sure yet why this is the case, as the new code is denser.   
> Maybe
> both codes get translated to the same micro-ops, but then the
> performance should be at least equal.
>
> The following example is with my changes (Intel Core2 Duo T9300 @
> 2.5GHz):
>
> time for map[a & 0xFF]: 1525 ms
> time for map[a + 0x80]: 1461 ms
>
> The first one boils down to:
>
>  0xfffffd7ffa3029d2: movzbl 0x19(%r10),%r10d
>  0xfffffd7ffa3029d7: movzwl 0x18(%r14,%r10,2),%r10d  ;*caload
>  0xfffffd7ffa3029dd: mov    %r10w,0x1a(%rbp,%rdi,2)  ;*castore
>
> and the second to:
>
>  0xfffffd7ffa302b24: movsbl 0x19(%r13,%rdi,1),%r10d
>  0xfffffd7ffa302b4e: movslq %r10d,%r10
>  0xfffffd7ffa302b51: movzwl 0x118(%r14,%r10,2),%r10d  ;*caload
>  0xfffffd7ffa302b5a: mov    %r10w,0x1a(%rbp,%rdi,2)  ;*castore
>
> Maybe out-of-order execution and micro-ops optimizations (I don't know
> if there are any in an Intel CPU) can combine movsbl and movslq to one
> micro-op, but still, both variants should have the same performance.
> Generating movzbq instead of movzbl gives:
>
> time for map[a & 0xFF]: 1533 ms
>
>  0xfffffd7ffa303312: movzbq 0x19(%r10),%r10
>  0xfffffd7ffa303317: movzwl 0x18(%r14,%r10,2),%r10d  ;*caload
>  0xfffffd7ffa30331d: mov    %r10w,0x1a(%rbp,%rdi,2)  ;*castore
>
> However, I think we should integrate my changes as it opens up the
> possibility for new optimizations more easily, e.g. superword.  The
> unrolled loop could then use a code sequence like:
>
> pxor      xmm7, xmm7
> movdqa    xmm1, xmm0 ; copy source
> punpcklbw xmm0, xmm7 ; unpack the 8 low-end bytes
>                     ; into 8 zero-extended 16-bit words
> punpckhbw xmm1, xmm7 ; unpack the 8 high-end bytes
>                     ; into 8 zero-extended 16-bit words
>
> Processing 8 or 16 values at once.  And that should definitely be
> faster...
>
> -- Christian
>