thinking about proper implementation of tail calls

Thu Aug 30 17:39:18 PDT 2012

I haven't given up on "hard tail calls" aka "proper implementation of tail calls".  Other things are more pressing—as always, it seems.  Here's a quick update, so we can mull this over as we look for an opportune time to warm up the project.

Michael Haupt passed me this useful link to Guy Steele's 2011 blog entry about tail calls:
  http://www.eighty-twenty.org/index.cgi/tech/oo-tail-calls-20111001.html

The comments are very interesting, and contain other good links, including Matthias Felleisen's ECOOP 2004 keynote.

As I recall, Doug Lea noted at the 2010 JVM Language Summit that tail calls would seem to allow work-stealing algorithms to be implemented somewhat more cleanly or efficiently.  (How's that for tentative?)  A worker thread goes from task to task in a data-driven way, similar to the examples given (e.g., in Guy's article) of linked list traversal.  Currently you need an outer loop to "pump" the task sequence, and each task must return to the loop to allow the next task to run.  If you had tail calls, you would not need to force the task sequence into a form that the outer loop can understand, which would allow more flexibility in organizing the task sequence.  More freedom in shaping the code can translate into better-tuned algorithms.

However this effect may in practice be swamped by other performance issues, such as synchronization.  The main claim here is that there can come a point, when other overheads are minimized, that the lack of tail calls blocks the next optimization.

The point is analogous to the discussion about driver loops versus tail-calling iteration patterns.  The multiple threads add a dimension.  Tail calling gives a thread (more properly, a stack frame within a thread) a way to transition to the next task without passing through a trampoline (aka, a driver loop).  It also allows modest recursion (through a data structure of modest depth), without the overhead of creating multiple copies of the driver loop.  This latter point is probably the most interesting, since work-stealing task structure is likely to be "lumpy" (somewhat nested) and not always a clean linear list.

Here is a wiki page on tail calls that we collected at the 2010 Summit:
  http://wiki.jvmlangsummit.com/Why_Tailcalls

I have been looking for a "killer application" for tail calls on the JVM.  Doug's example may turn out to one.  Here's another candidate:  Overcoming class file size limits.

Suppose you are compiling your favorite high-level language to the JVM, and you start running into the various size limits in class files.  (Usually they come from the 16-bit indexes in various places.)  What do you do, dear?  (H/T to Maurice Sendak, http://www.amazon.com/What-Do-You-Dear/dp/0064431134 .)

You could growl a curse and try to emit smaller methods, by changing earlier parts of your compilation stack.  Or you could bail out and use an AST interpreter.  In fact, these are your only options at present.

(To defend the Status Quo:  JVMs are tuned to work well on small methods, and so they have a synergistic bias toward the 16-bit limit.  Gargantuan methods make JITs unhappy, so maybe turning the problem back on language implementors is just a way of driving them away the edge of the NP-hard canyon.)

It would be great if there were an ASM module which could automatically render an over-sized "mega-method" into some JVM-compliant form.  What would this take?  A number of things:

1. The proposed mega-method would have to be broken into a control flow graph of basic blocks.

2. The basic blocks would have to be packed into one or more automatically-generated normal "mini-methods" (of JVM-compliant sizes).

3. Control transitions between basic blocks would have to be re-coded, if they cross from one mini-method to another.
3a. These transitions would not be allowed to blow the stack.
3b. The live data in local variables (as of the given transition) would have to be transmitted somehow.

4. Something clever would have to be done to make the runtime stack walker see the originally named mega-method.

Step 3b is non-trivial, and probably requires boxing on the heap, in general.  (Remember that you can have up to 2^16-1 locals, but only 2^8-1 parameters.)  In the case where there are only a few locals, they can just be passed in registers as parameters.

Step 3a would seem to require tail calls.  Specifically, the act of leaving one mini-method M1 to go to a block in another mini-method M2 requires a hard tail call from M1 to M2, passing any live data values that the M2 block requires.

The corresponding act of entering a given target block in M2 also requires some thought.  Unless you are lucky or clever, or unless you render one basic block to one mini-method, M2 will (logically) have multiple entry points, one for each basic block that is a successor of a basic block in some other mini-methd M1.  Representing such multiple entry points is tricky.  Here are three ways to do it:

A. Add an integer parameter (basic block selector index) to M2, and structure M2 as a Big Switch.  (This is a common technique; Rhino does this for example.)  Then M1 passes the selector as an extra parameter.

B. Add multiple entry points to the class file format.  In particular, allow M2 to declare multiple entry points, perhaps with an entry point instruction which looks like a Big Switch.  Compile the rest of M2 as in A.  Link calls into M2 using compound symbolic references such as "M2.42", where 42 is the switch selector.  (Note that dot '.' is illegal in method names.)

C. Structure M2 as in A, but run the call from M1 through invokedynamic, without the extra parameter.  When linking the call from M1, bind the extra parameter into the method handle for M2.  Ask your friendly neighborhood JVM implementor to optimize this method handle, at link time, into a form which is equivalent to the extra entry points in B.

BTW, an early proposals for JVM support for closures had a similar flavor:  We might have created an instruction for building a closure which would (a) mention a BCI in the same method that provides the closure body, and (b) specify which locals are to be bound into the closure.  The attractive thing about such a design is that it does not require the overhead of building a fake method body (really, a mini-method) for each lambda.  The unattractive thing is that it requires lots of other new structure in the JVM spec.  I like where we fell out with using invokedynamic to render closures.  It might be nice, though, to allow some kind of light-weight private/anonymous method as a shorthand in class files.  I don't know what that would look like.  Maybe it would be a CONSTANT_NameAndTypeAndBCI.

— John

P.S.  The distinction between "soft" and "hard" tail calls is described here:
  https://blogs.oracle.com/jrose/entry/tail_calls_in_the_vm

I don't care much about soft tail calls (as an elective optimization) as much as hard tail calls (proper implementation, which makes a contract not to blow the application stack under defined conditions).