Introductions

Fri Apr 25 15:50:27 PDT 2008

John Rose wrote:
> After a few hundred bytecodes, the inlining heuristic starts to get 
> scared.  Hand inlining is like that:  The JVM prefers to do this for you.
> 
> You can play with -XX:FreqInlineSize=N where N is larger than your 
> switch method.  Also maybe bump -XX:MaxInlineSize (default 35).
> 
> But it is better to use small methods and let the JVM pick inlinings.  
> You may have found a weakness in the profiling that causes the inliner 
> to fall down.

The challenge for both JRuby and JOni is that our switches alone are of 
a substantial size. In the JRuby interpreter, there's over 100 cases. 
Even without inlining any of them by hand, having even the simplest 
logic in those switches ends up pushing beyond the default inline size. 
But how else can it be implemented?

> That's a weak part of our heuristic.  Hmm.  It's especially bad when 
> there is a great variation in size (switch on constant vs. switch on 
> non-constant).  We know we need heuristics that can estimate the effects 
> of constant folding.  Branches (switches) based on parameter values are 
> a key case.

For JRuby interpreter, there are probably "hottest" cases (newlines, 
calls, literals) and "coldest" cases (flip-flop, unusual special $ 
globals), but the "warm" cases will be pretty widely distributed. Of 
course, I haven't measured this...it might be worth getting real 
measurements or...is there a way to dump out what profiling HotSpot does 
on switches today?

> That is reasonable.  It keeps the switch per se small, and (implicitly) 
> asks the JVM to inline the opcode methods that really matter.

JRuby interpreter works the same way, with perhaps a handful of "really 
hot" cases directly inlined into the switch.

> The PyPy experience shows that partial evaluation of your Big Switch can 
> take you a long way.  I'd like to make our switch optimizations work 
> better, even if you end up going all the way to bytecodes.

I think this is going to be a must. The cost of compiling and loading 
Java bytecodes is always going to be more than the cost of "hitting the 
ground running" by executing the language's own bytecodes or AST 
directly at first. And for cases like Ruby, where "eval" is a family 
favorite, we simply can't *afford* to compile everything. Ask the DLR 
guys about the cost of compiling everything and you'll find out why they 
backtracked this past year and started writing interpreter-generation 
into the DLR. We're not as bad as them (we don't have to also JIT to 
native code before executing) but it's a similar problem. And on our 
side, we've got the class and permgen bookkeeping to add back a little pain.

What bothers me about Marcin's numbers is now much better Harmony 
does...as much as 3x better performance on the big switches. We just 
recently got our interprter to run faster than C Ruby's by eliminating 
some arg boxing, but if we could get a 3x improvement on switch-based 
interpretation we'd almost be beating the Ruby 1.9 *bytecode* execution.

> We've started a wiki for this purpose; see above.  It would be great if 
> you (or anyone else on the hotspot learning curve) would contribute to 
> it as you discover important facts.  I've added stuff, but since I've 
> been working on this for 10 years, it's hard to have perspective on what 
> newcomers need to know.  And, this is the best year by far for being a 
> newcomer!

And I'll say it again: Marcin, get on that wiki and just start dumping 
what you find. We'll all be better for it.

- Charlie