A question about bytecodes + unsigned load performance ./. add performace

Fri Jan 9 17:05:13 PST 2009

On Jan 9, 2009, at 3:44 PM, Ulf Zibis wrote:

> Am 10.01.2009 00:17, John Rose schrieb:
>>
>>  Note that compilers tend to optimize expressions like myByteArray 
>> [i] & 0xFF into unsigned loads, and packaging this into an  
>> intrinsic method would add predicability of compilation (if  
>> anybody cares), and the case is not frequent enough to warrant  
>> shaving a few bytes off the instruction format.
>>
> ... but myByte + 0x80 is faster than myByte & 0xFF.  For me this is  
> an unintelligible mystery.
> source see here (line 141..144):
>      http://hg.openjdk.java.net/jdk7/tl/jdk/file/b89ba9a6d9a6/make/ 
> tools/src/build/tools/charsetmapping/GenerateSBCS.java
>
> How can adding be faster than unsigned load of a byte?

Probably because (1) the bias of 128 can be folded into the address  
mode of the instruction which implements cc[byte + 128], and (2) the  
the 0xFF mask would need to be applied immediately when the byte is  
loaded from memory, in order for the optimizer to fold (AndI 255  
(LoadB p)) to (LoadUB p).  If the AndI gets separated from the LoadB  
(perhaps via intervening Phi functions) then the optimizer cannot do  
the peepholing.  But if the AndI/LoadB expression is placed in a  
getUnsigned intrinsic m method, then the compiler will be able to see  
both operations within the same small "window".

(Regarding (1) you might ask why the +128 does not slow down the  
range check.  That gets into details of range check elimination, but  
the short answer is that the bias of 128 gets folded into the RCE  
optimizations also.  A little more detail:  RCE works by iteration  
space splitting, into pre/main/post loops usually, and the  
calculations which govern those are add/subtract/compare/min/max  
nests, into which the +128 merges nicely without adding much overhead.)

Thanks for pointing out the +128 trick.  That's a good one!  I wish  
the JIT could do it automagically, but I don't see how, since it  
probably requires swapping the halves of a table structures, and our  
optimizer is not nearly that heroic.

-- John