thinking about proper implementation of tail calls

Fri Aug 31 00:49:59 PDT 2012

An interesting idea! Comments below.

On Thu, Aug 30, 2012 at 7:39 PM, John Rose <john.r.rose at oracle.com> wrote:
> I have been looking for a "killer application" for tail calls on the JVM.  Doug's example may turn out to one.  Here's another candidate:  Overcoming class file size limits.
>
> Suppose you are compiling your favorite high-level language to the JVM, and you start running into the various size limits in class files.  (Usually they come from the 16-bit indexes in various places.)  What do you do, dear?  (H/T to Maurice Sendak, http://www.amazon.com/What-Do-You-Dear/dp/0064431134 .)
>
> You could growl a curse and try to emit smaller methods, by changing earlier parts of your compilation stack.  Or you could bail out and use an AST interpreter.  In fact, these are your only options at present.
>
> (To defend the Status Quo:  JVMs are tuned to work well on small methods, and so they have a synergistic bias toward the 16-bit limit.  Gargantuan methods make JITs unhappy, so maybe turning the problem back on language implementors is just a way of driving them away the edge of the NP-hard canyon.)
>
> It would be great if there were an ASM module which could automatically render an over-sized "mega-method" into some JVM-compliant form.  What would this take?  A number of things:

I believe one is in progress right now, but I have not looked at it
and do not know how it approaches the problem.

JRuby has had to deal with this for some time, and we took the
dirt-simple approach: chain methods one to the next, splitting on root
AST elements. The side effects are unavoidable and sometimes extreme:

* We can only split methods that are long but flat; even a single bit
of control flow at the top-level obscures chainability. This also
avoids us having to maintain stack state across the chained methods.
* Rather than deal with the hassle of preserving JVM locals across the
chained calls, chaining triggers heap-scoped variables. This isn't
unavoidable, but it's messier to make the compiler know all the locals
to pass along, deal with them on the other side, and so on.
* We can't see accurately the eventual complexity of individual
toplevel AST elements, so it's possible we can still blow the size
budget in one of the chained bodies.

We have not (or I guess, *I* have not) seen fit to improve upon this
mostly because JRuby does have an AST interpreter, and there are
currently no cases I know of in the wild where hot application code
requires a method we can't compile within JVM classfile limits.
However...there are *definitely* many really big methods, and it's not
uncommon for us to sit down with a user having performance issues and
discover some behemoth the JIT just refuses to compile. To ameliorate
that, we instituted bytecode size limits in the JRuby JIT, since if it
doesn't compile in the JVM JIT...we're better off using our
interpreter.

> 1. The proposed mega-method would have to be broken into a control flow graph of basic blocks.
>
> 2. The basic blocks would have to be packed into one or more automatically-generated normal "mini-methods" (of JVM-compliant sizes).

This is essentially what the "new" JRuby runtime is intended to do. We
have a full intermediate representation with basic blocks, control
flow, and several evolving optimization passes we can run over that IR
(including early forms of inlining...all the way through to closures).
Our current hope is that the JRuby version after JRuby 1.7 (JRuby 9000
is a name tossed about) we'll do exactly this...package the basic
blocks into mini-methods and chain them together appropriately. But as
you point out, this is not without challenges on a non-tail-calling
VM...

> 3. Control transitions between basic blocks would have to be re-coded, if they cross from one mini-method to another.
> 3a. These transitions would not be allowed to blow the stack.

It's worth pointing out that for code made up of basic blocks with no
backward branches, there are fewer stack concerns. This is probably a
small percentage of the code we're interested in chopping up, though.
(And obviously, if you break a large method into too many stack
elements, you're eating a lot of stack resources too).

Interesting thought comes to mind... currently inlining depth is
limited to 9. If it were significantly higher for "mini-methods", we'd
be much better off; ignoring the interpreter, for the moment, deeper
inlining would mean that for directly inlinable chains of
mini-methods, we could be nearly guaranteed that we're eating no more
stack space than if the method were compiled as a single unit.

> 3b. The live data in local variables (as of the given transition) would have to be transmitted somehow.

In our case, we'd probably introduce a pass for this that turned basic
block transitions into load (as arguments) plus call (one basic block
to the next) plus store (into new set of locals) for only the
variables that are needed downstream. I suspect in a large number of
cases, the number of variables we *actually* need to propagate for
methods this large would be far smaller than the total number of
variables the behemoth method uses in toto.

> 4. Something clever would have to be done to make the runtime stack walker see the originally named mega-method.

Good lord, this would be a huge boon for all of us. Perhaps the number
one complaint we JVM language implementers get from off-platform
immigrants is how positively *dreadful* the stack traces are for
non-Java JVM languages. In JRuby we've bent over backwards to hide
those horrible traces, even going so far as to implement our own
mixed-mode stack trace logic atop the JVM stack trace. But it's not
foolproof: Java traces that happen to span Ruby code will show JRuby
interpreter frames, bridge logic for arity adaptation and out-of-band
data like frames and scopes, and gobs of JRuby internals nobody ever
needs to see...and since we don't control normal Java exceptions, we
can't do our trace-rewriting magic.

Anything that would allow language authors to hook into the process of
stack trace generation would be greatly appreciated by us JVM language
authors, regardless of how it relates to tail calling and "exploded"
methods you're describing here. :)

> Step 3b is non-trivial, and probably requires boxing on the heap, in general.  (Remember that you can have up to 2^16-1 locals, but only 2^8-1 parameters.)  In the case where there are only a few locals, they can just be passed in registers as parameters.

Again I'll point out that it's probably uncommon (but not impossible)
to see methods that actually need to propagate greater than 2^8-1
local variables through their entire exploded method graph. But your
point is valid...if you need to do it, you need to do it.

And of course if you have to branch out of a basic block and have the
locals' values available, you're stuck.

In our IR runtime, we're hoping that we can do small, localized
allocations of variable holders and expect that escape analysis will
make them disappear. That's how it's supposed to work in theory,
right? :)

> Step 3a would seem to require tail calls.  Specifically, the act of leaving one mini-method M1 to go to a block in another mini-method M2 requires a hard tail call from M1 to M2, passing any live data values that the M2 block requires.
>
> The corresponding act of entering a given target block in M2 also requires some thought.  Unless you are lucky or clever, or unless you render one basic block to one mini-method, M2 will (logically) have multiple entry points, one for each basic block that is a successor of a basic block in some other mini-methd M1.  Representing such multiple entry points is tricky.  Here are three ways to do it:
>
> A. Add an integer parameter (basic block selector index) to M2, and structure M2 as a Big Switch.  (This is a common technique; Rhino does this for example.)  Then M1 passes the selector as an extra parameter.
>
> B. Add multiple entry points to the class file format.  In particular, allow M2 to declare multiple entry points, perhaps with an entry point instruction which looks like a Big Switch.  Compile the rest of M2 as in A.  Link calls into M2 using compound symbolic references such as "M2.42", where 42 is the switch selector.  (Note that dot '.' is illegal in method names.)
>
> C. Structure M2 as in A, but run the call from M1 through invokedynamic, without the extra parameter.  When linking the call from M1, bind the extra parameter into the method handle for M2.  Ask your friendly neighborhood JVM implementor to optimize this method handle, at link time, into a form which is equivalent to the extra entry points in B.

This seems sound to me. The combination of using indy + MH to wire one
BB to the next eliminates the multiple entry-point problem, and tail
calling eliminates the stack effects of making those calls. If we can
also get EA to clean up all the little temporary variable holders we
need along the way, we're basically there (other than needing
stack-trace cleanup even more than we did before!)

- Charlie