A question about bytecodes + unsigned load performance ./. add performace

Mon Jan 12 08:29:40 PST 2009

On Sat, 2009-01-10 at 10:39 -0800, John Rose wrote:

> It's already in there, to some degree, but hindered somehow by the
> peepholing problem.  See 'instruct loadUB' around line 6406 of:
>   http://hg.openjdk.java.net/jdk7/hotspot/hotspot/file/tip/src/cpu/x86/vm/x86_32.ad
> 
> 
> What that does is, when it is time to "match" (or lower) ideal to
> machine nodes in the IR graph, if a suitable AndI and LoadB are
> adjacent, and if the LoadB is unshared, they are coalesced into a
> loadUB machine node.
> 
> 
> It would be a detailed debugging exercise to find out why, in the case
> of your code, that optimization does not appear to kick in.

I tried to take a look at it, but now I'm stuck.

The ideal nodes in question are:

 129	LoadB	===  311  51  127  [[ 141 ]]  @byte[int:>=0]:exact+any *, idx=4; #byte !jvms: test::foo @ bci:28
 140	ConI	===  0  [[ 141  217  268  347  439  441 ]]  #int:255
 141	AndI	===  458  129  140  [[ 164 ]]  !orig=[377] !jvms: test::decode @ bci:4 test::foo @ bci:29

So loadUB should match but it does not (and I don't know why, yet).  The
opto output is:

102   B7: #	B6 B8 <- B6  Freq: 2
102   	movslq  R10, R11	# i2l
105   	movq    R8, [rsp + #8]	# spill
10a   	movsbl  R8, [R8 + #24 + R10]	# byte
110   	incl    R11	# int
113   	movzbl  R8, R8	# int & 0xFF
117   	movw    [R9 + #24 + R10 << #1], R8	# char/short
11d   	cmpl    R11, #1
121   	jl,s   B6	# loop end  P=0.500000 C=22950.000000

It seems the increment of the loop variable gets scheduled between LoadB
and immI_255 and thus loadUB cannot match.

Not sure yet when matching is applied and if I'm right with my
assumption above.  I'm looking further...

-- Christian