Tailcalls in the jvm

Fri Feb 15 01:51:10 PST 2008

On Feb 14, 2008, at 2:09 PM, Arnold Schwaighofer wrote:

> On Thu, Feb 14, 2008 at 2:01 AM, John Rose <John.Rose at sun.com> wrote:
>> One disadvantage of lazy popping is the extra TDC argument
>> and (perhaps) overheads interpreting it.
>>
>> With eager popping, each caller takes down his own frame,
>> by compiling customized code.
>
> I think optimally it would be a mix of both (for maximum efficiency).

Interesting; probably true.  There's still the question of which to  
attack first.

>> One thing to worry about is whether there is enough stack
>> space in the tail-caller's caller to hold the outgoing arguments
>> to the tail-callee.
>
> Yes you are right. but it does not fully solve the problem. we can not
> just resize (enlargen) the stack at the callee site.
Yes; my thought was that lazy popping might make this easier.

As you say, the tail-caller has full information about incoming and
outgoing argument lists, so maybe he is the right guy to be
responsible for pushing any adapters.

In my experience with Scheme, a very common case is that the
tail-caller's outgoing arguments are the same size as the incoming.
(E.g., let-loop.)  Another very common case is that the tail-caller's
outgoing arguments are less than the number of fixed argument
registers (e.g., 6 on SPARC).

> my idea (well actually Christian suggested using adapter frames - my
> idea was way more complicated - resizing out going areas during
> runtime, remember my comment about adapter frames in the last email -
> they are a solution to many problems :) to deal with more arguments is
> inserting a dummy frame that contains a big area for outgoing
> arguments in cases where the callee has more arguments. This implies a
> check (does the dummy frame exist and is it big enough) at the callee
> site (another entry point combination or stub) for the general case
> (more arguments).

Yes.  If there's only one dummy frame, you can check the return-pc.
It doesn't need to be very large at first, but it does need to be  
arbitrarily
expandable.

(Or, the dummy frame could be some sort of specialized interpreter  
frame.
Those are resizable.)

> e.g if we have a fun(arg1, arg2) tail calling a fun2(arg1, arg2, arg3,
> arg) than before control is handed over to fun2 a dummy/adapter frame
> is created where the arguments can go. that is if there isn't already
> a dummy frame (because maybe fun was also tail called).

Agreed.

> So the original idea was (not taking the PD problem into account) to
> have two code paths:
>
> - fast case: callee has less or the same number of arguments, eagerly
>   pop the stack and jump to callee (in case of virt/interface calls
>   jump to transition stub)

Or callee has N or less arguments (N = minimum args in calling seq).

> - slower case: callee has more arguments, jump to a point in callee
>   (special entry point/stub) that checks whether the dummy frame
>   already exists and has an appropiate size(if not it puts it there),
>   here we would use the lazy popping variant and create the adapter
>   frame if needed.

I like it!

> as you say we would have the following cases:
>
>  - PD statically known (maybe improved through a PD analysis - e.g all
>    currently loaded method's with signature x are in the same PD) to
>    be the same
>
>     1. statically known callee: jump to unverified entry point
>
>     2. monomorphic callee: jump to verified entry point
>
>     3. megamorphic callee: jump to transition stub

This is where CHA could be helpful.  Consider the
no_finalizable_subclasses assertion in dependencies.hpp.

A similar no_cross_domain_overrides assert could reduce a
run-time check on the actual method being called to a compile-time
check on the superclass (e.g., abstract) method being called.

Your idea about a global invariant on signatures is nice too,
but it doesn't have infrastructure support.  The dependencies
stuff is flexible and powerful.  It is limited to statements
quantified over supertypes (mainly superclasses).

>  - PD might be different
>     4. megamorphic calee: jump to vtable/itable stub which checks PD

Good.

> So to summarize we have three (!) dimensions.
>
> arguments (same-less/more), callee target (static, dynamic), PD known
> to be equal or not

Impressive (and a little scary).

These two are both easiest to do first and will go a long way:
> 1a.) eagerly pop the stack jump to callee
> 2.a) eagerly pop the stack jump to callee (unverified entry point)

> 1b.) jump to special entry point that pops the stack (lazy) checks for
> a dummy (adapter frame), shuffels arguments
This one is less important that one would think, if N above is
reasonable.

The hard case is x86_32.  More below, where you suggest an approach.

> 6a)  always lazily pop the stack since pd checks can cause defeat
> 6b)

These two can be merged for a prototype.  I was going to say
that "all real programs" will have megamorphic calls that fit
case 3ab because of CHA or your signature rule.
But a script fragment running in a sandbox might tail-call
back and forth between script code (compiled in a safe PD)
and trusted system-level adapters (in the dynamic language's
runtime).

In Scheme, APPLY is a system-level "adapter" which a script
could call from an untrusted function, passing it another
untrusted function.  Eventually it would be good to look
at the performance of this pattern; maybe APPLY
(and similar pure combinators) be marked as trusted,
but effective call sites to something else (the first argument,
in APPLY's case).  Then the cross-PD check would skip
over the combinator (APPLY) and look directly at the
ultimate callee (first argument).  I don't see a compelling
play to make here immediately, though.

> [on x86 we are running out of registers so probably use stack slots]
Yes.  (See above for an approach to make those stack slots available
as part of everybody's calling sequence.)

> some of these cases might be collapsed. as some are more general
> and would work for other cases (while being less efficient of
> course).
Yes.

> int add2(int a, int b, int dummy, int dummy2, int dummy3, ...)
>   return a+b;

This is in fact the varargs save area that most riscs have.
The place where the server compiler declares those is
varargs_C_out_slots_killed in the *.ad file.  It makes every
caller stay out of the way of the last N argument slots,
whether it uses them or not.

On SPARC and x86_64, that AD file parameter is non-zero,
and matched to other values in the system, notably the interp.

I recommend developing on x86_64 or SPARC first,
since it will let you dodge cases 123b longer.

I think you'll want to put a non-zero value in
varargs_C_out_slots_killed on x86_32.

One tricky part is stretching outgoing non-compiled
argument lists to the minimum length, since
the x86_32 overloads ESP as both the C and Java
stack pointers.  But we have framework for this
sort of thing already.  See gen_i2c_adapter in
sharedRuntime_x86_32.cpp.

You'll probably be spending lots of time with code
like that.

> moving of the dummy arguments might be optimized 'away' by the
> compiler which could probably recognize that the arguments (dummy,
> dummy2) has not changed during the function.

Telling the register allocator that they are "killed slots" has
the desired effect.  Every caller just allocates and avoids them.
The interpreter (and maybe random other callers) needs to
take care to leave those bubbles in the stack when calling
compiled code.

> gee - and we haven't talk about what to do in case of interpreted to
> compiled and vice versa :)

The fourth dimension.  :-(  You'll note that every method has a
compiled and an interpreted entry point.  The upside is that
the interpreter stack frames are self-identifying and flexible.

Send along the diffs whenever you've got something you want to show!

BTW, we're working out the process in real-time:
   http://openjdk.java.net/guide/

Best,
-- John