Request for reviews (L): 6814842: Load shortening optimizations

Fri May 8 05:01:17 PDT 2009

On Fri, 2009-05-08 at 12:29 +0200, Ulf Zibis wrote:
> Am 07.05.2009 15:55, Christian Thalinger schrieb:
> > On Thu, 2009-05-07 at 15:40 +0200, Ulf Zibis wrote:
> >   
> >> E.g.:
> >>     char decode(byte b) {
> >>         return (char)(b & 0xFF);
> >>     }
> >>     
> >
> > This is handled by HotSpot for a long time.
> >   
> 
> Hm, I can't share this experience.
> 
> Results from my benchmark on JDK7 b51:
> <https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/test/DecoderBenchmark.java?rev=672&view=markup>
> 
> time for (char)a: 1652 ms
> time for (char)(a & 0xFF): 2772 ms
> 
> IMHO (char)(a & 0xFF) should be faster, or at least as fast as (char)a, 
> because only basic LoadUB should be executed.

I already said that once, it's not the code the makes the second one
slower, it's the register allocation.  When running with:

-XX:LoopUnrollLimit=1

you get more accurate numbers regarding the generated code.

I looked at the code of loop3 and loop4 and they are actually identical.
Except loop4 uses a zero-extension move instead of a sign-extension one,
which is what we want.

time for (char)a: 1350 ms
time for (char)(a & 0xFF): 1429 ms

time for inlined (char)a: 1493 ms
time for inlined (char)(a & 0xFF): 1494 ms

The differences in runtime might be related to other things
(cache, ...), since loop3/loop4 generate exact the same code as
inline3/inline4.

-- Christian