A question about bytecodes + unsigned load performance ./. add performace
Tom Rodriguez
Thomas.Rodriguez at Sun.COM
Tue Jan 20 11:22:53 PST 2009
On Jan 19, 2009, at 6:43 AM, Christian Thalinger wrote:
> On Fri, 2009-01-16 at 12:29 -0800, John Rose wrote:
>> Yes. It's a valid ideal node type, not hardware specific and useful
>> to optimizations.
>
> As I've already written on hotspot-dev, the optimization works but
> it's
> generally not faster.
I think on an out of order machine that much of this simply gets
hidden, especially when there's a lot of load and store traffic. Try
measuring a loop without stores, maybe something that simply sums the
results. Sparc would benefit from this much more since it's in order
and sign extending loads introduce a bubble in the pipeline.
tom
>
>
> I'm not sure yet why this is the case, as the new code is denser.
> Maybe
> both codes get translated to the same micro-ops, but then the
> performance should be at least equal.
>
> The following example is with my changes (Intel Core2 Duo T9300 @
> 2.5GHz):
>
> time for map[a & 0xFF]: 1525 ms
> time for map[a + 0x80]: 1461 ms
>
> The first one boils down to:
>
> 0xfffffd7ffa3029d2: movzbl 0x19(%r10),%r10d
> 0xfffffd7ffa3029d7: movzwl 0x18(%r14,%r10,2),%r10d ;*caload
> 0xfffffd7ffa3029dd: mov %r10w,0x1a(%rbp,%rdi,2) ;*castore
>
> and the second to:
>
> 0xfffffd7ffa302b24: movsbl 0x19(%r13,%rdi,1),%r10d
> 0xfffffd7ffa302b4e: movslq %r10d,%r10
> 0xfffffd7ffa302b51: movzwl 0x118(%r14,%r10,2),%r10d ;*caload
> 0xfffffd7ffa302b5a: mov %r10w,0x1a(%rbp,%rdi,2) ;*castore
>
> Maybe out-of-order execution and micro-ops optimizations (I don't know
> if there are any in an Intel CPU) can combine movsbl and movslq to one
> micro-op, but still, both variants should have the same performance.
> Generating movzbq instead of movzbl gives:
>
> time for map[a & 0xFF]: 1533 ms
>
> 0xfffffd7ffa303312: movzbq 0x19(%r10),%r10
> 0xfffffd7ffa303317: movzwl 0x18(%r14,%r10,2),%r10d ;*caload
> 0xfffffd7ffa30331d: mov %r10w,0x1a(%rbp,%rdi,2) ;*castore
>
> However, I think we should integrate my changes as it opens up the
> possibility for new optimizations more easily, e.g. superword. The
> unrolled loop could then use a code sequence like:
>
> pxor xmm7, xmm7
> movdqa xmm1, xmm0 ; copy source
> punpcklbw xmm0, xmm7 ; unpack the 8 low-end bytes
> ; into 8 zero-extended 16-bit words
> punpckhbw xmm1, xmm7 ; unpack the 8 high-end bytes
> ; into 8 zero-extended 16-bit words
>
> Processing 8 or 16 values at once. And that should definitely be
> faster...
>
> -- Christian
>
More information about the hotspot-compiler-dev
mailing list