A question about bytecodes + unsigned load performance ./. add performace

Fri Jan 9 18:55:55 PST 2009

Am 10.01.2009 02:05, John Rose schrieb:
>
> Probably because (1) the bias of 128 can be folded into the address 
> mode of the instruction which implements cc[byte + 128], and (2) the 
> the 0xFF mask would need to be applied immediately when the byte is 
> loaded from memory, in order for the optimizer to fold (AndI 255 
> (LoadB p)) to (LoadUB p).  If the AndI gets separated from the LoadB 
> (perhaps via intervening Phi functions) then the optimizer cannot do 
> the peepholing.  But if the AndI/LoadB expression is placed in a 
> getUnsigned intrinsic m method, then the compiler will be able to see 
> both operations within the same small "window".
>
> (Regarding (1) you might ask why the +128 does not slow down the range 
> check.  That gets into details of range check elimination, but the 
> short answer is that the bias of 128 gets folded into the RCE 
> optimizations also.  A little more detail:  RCE works by iteration 
> space splitting, into pre/main/post loops usually, and the 
> calculations which govern those are add/subtract/compare/min/max 
> nests, into which the +128 merges nicely without adding much overhead.)
>
> Thanks for pointing out the +128 trick.  That's a good one!  I wish 
> the JIT could do it automagically, but I don't see how, since it 
> probably requires swapping the halves of a table structures, and our 
> optimizer is not nearly that heroic.

The code, for which the table swapping is done :

    byte[] byteBuf;
    char[] charBuf;
    int j = offset;
    for (int i=0; i<buf.length; i++,j++)
        charBuf[j] = decode(byteBuf[i]);
    ....
    public char decode(byte inByte) throws UnmappableCharacterException, 
MalformedInputException { // maybe overridden
        return b2cMap[inByte + 0x80];
    }

I wish, the JIT could see, that the inByte comes out of a byte[], so if 
loaded as unsigned byte directly by native CPU opcode, there is nothing 
to mask by [.. & 0xFF], which should be faster than +128 in any case.
Do you see any chance, that HotSpot optimizer could be enhanced in that 
way, because the loop, we are speaking about, is the central loop in all 
charset coders of the JVM.

-Ulf