thinking about proper implementation of tail calls

Sat Sep 1 04:26:54 PDT 2012

* John Rose:

> As I recall, Doug Lea noted at the 2010 JVM Language Summit that
> tail calls would seem to allow work-stealing algorithms to be
> implemented somewhat more cleanly or efficiently.  (How's that for
> tentative?)  A worker thread goes from task to task in a data-driven
> way, similar to the examples given (e.g., in Guy's article) of
> linked list traversal.  Currently you need an outer loop to "pump"
> the task sequence, and each task must return to the loop to allow
> the next task to run.  If you had tail calls, you would not need to
> force the task sequence into a form that the outer loop can
> understand, which would allow more flexibility in organizing the
> task sequence.  More freedom in shaping the code can translate into
> better-tuned algorithms.

I would be surprised if tail calls give more flexibility.  Usually
it's the opposite, the mini-interpreter wins in terms of flexibility:

  <http://www.enyo.de/fw/notes/tail-calls.html>

The mini-interpreter turns direct calls into difficult-to-predict
indirect calls.  This could be a drawback.  But optimizing that is
probably not much more difficult than a general implementation of hard
tail calls.

> I have been looking for a "killer application" for tail calls on the
> JVM.  Doug's example may turn out to one.  Here's another candidate:
> Overcoming class file size limits.

I doubt it.

> Step 3b is non-trivial, and probably requires boxing on the heap, in
> general.  (Remember that you can have up to 2^16-1 locals, but only
> 2^8-1 parameters.)  In the case where there are only a few locals,
> they can just be passed in registers as parameters.

It also needs multiple return values or unboxed tuples, so that
updated state can be carried around without boxing.

> Step 3a would seem to require tail calls.  Specifically, the act of
> leaving one mini-method M1 to go to a block in another mini-method
> M2 requires a hard tail call from M1 to M2, passing any live data
> values that the M2 block requires.

Only if you can have closures which share mutable variables (sometimes
called "upvalues").  Otherwise, you still have marshal state updates
to the canonical locals, using a mini-interpreter, and the individual
state machine entries would have different function signatures.

Or you have to go full CPS and have an endless parameter list, but
that's going to run into the parameter count limit.  Copying this much
state around isn't going to be free, either.  And each entry in the
state machine would have to explicitly thread its full state, so it's
most useful for automatically generated code.