Tailcalls in the jvm

Thu Feb 14 14:09:23 PST 2008

On Thu, Feb 14, 2008 at 2:01 AM, John Rose <John.Rose at sun.com> wrote:
> On Feb 13, 2008, at 2:29 PM, Arnold Schwaighofer wrote:
>
> > On Feb 12, 2008 10:44 AM, John Rose <John.Rose at sun.com> wrote:
> One key question is who does the frame-popping.

yes

> One way is for the tail-caller to eagerly pop his own frame, but pass
> a PD token either to the callee's special entry point, or to an
> intermediate
> "transition stub". Call this eager popping. It is the standard
> technique.
>
> The other way is for the tail-caller to try to keep his frame, but
> pass a PD token plus a take-down cookie ("TDC", e.g., FP or frame size)
> to the immediate callee. (Which is either a special method entry
> point or again a transition stub.) Call this lazy popping.
>
> The advantage of lazy popping is that you don't need to create
> frames when a TC must be defeated. All the frames stay "standard",
> and this interacts most simply with deoptimization, debugging,
> security checks, and anything else that walks the stack.
>
> One disadvantage of lazy popping is the extra TDC argument
> and (perhaps) overheads interpreting it.
>
> With eager popping, each caller takes down his own frame,
> by compiling customized code.

I think optimally it would be a mix of both (for maximum efficiency).

> Eager popping seems to be the more aggressive technique.
> One thing to worry about is whether there is enough stack
> space in the tail-caller's caller to hold the outgoing arguments
> to the tail-callee. In general there isn't. That consideration
> alone might lead us to favor lazy popping.

Yes you are right. but it does not fully solve the problem. we can not
just resize (enlargen) the stack at the callee site. The caller of the
'tail calling' caller does not know the size by which we are
adjusting. Changing the size of the outgoing area without the function
knowing it will lead to erroneous behavior :)

my idea (well actually Christian suggested using adapter frames - my
idea was way more complicated - resizing out going areas during
runtime, remember my comment about adapter frames in the last email -
they are a solution to many problems :) to deal with more arguments is
inserting a dummy frame that contains a big area for outgoing
arguments in cases where the callee has more arguments. This implies a
check (does the dummy frame exist and is it big enough) at the callee
site (another entry point combination or stub) for the general case
(more arguments).

e.g if we have a fun(arg1, arg2) tail calling a fun2(arg1, arg2, arg3,
arg) than before control is handed over to fun2 a dummy/adapter frame
is created where the arguments can go. that is if there isn't already
a dummy frame (because maybe fun was also tail called).

So the original idea was (not taking the PD problem into account) to
have two code paths:

- fast case: callee has less or the same number of arguments, eagerly
  pop the stack and jump to callee (in case of virt/interface calls
  jump to transition stub)

- slower case: callee has more arguments, jump to a point in callee
  (special entry point/stub) that checks whether the dummy frame
  already exists and has an appropiate size(if not it puts it there),
  here we would use the lazy popping variant and create the adapter
  frame if needed.

> You might think that lazy popping would make it impossible to
> collapse ("crush") stack frames into adapters which summarize
> PD contexts. That's not true, since HotSpot has adequate stack
> walking facilities. When you decide stack space is a getting
> tight after too many consecutive defeated tail calls, you trap
> into the VM code, look at the stack, and reformat it, like
> deoptimization does today.
>
> I'm not sure which one to try first, but I lean towards
> lazy popping, since it makes the stack look more "normal"
> to the reflective parts of the VM runtime.
>
> ---
> Another key question is what is the shape of a call site.
>
> HotSpot call sites are simple and general "inline caches":
> set #patchable_token, %inline_cache_reg
> call patchable_immediate_callee
>
> Two machine words are patchable here. The initial
> state is that the immediate callee is a linkage routine.
>
> The usual state is called "monomorphic", where the
> immediate callee is the "verified entry point" of the
> actual method, and the token is the expected class,
> which every VEP checks. VEPs are the same
> couple of instructions everywhere; they act like
> inlined transition stubs.
>
> This usual case includes most nominally virtual call sites.
>
> The fallback state is called "megamorphic", where
> the immediate callee is a transition stub that performs
> dynamic dispatch.
>
> A typical transition stub loads a the receiver's class,
> loads a pointer from the vtable, and jumps through
> the pointer.
>
> The "patchable token" is garbage except for monomorphic
> calls. This makes most state transitions atomic by virtue
> of an atomic patch to the immediate callee. (It's not the
> whole story; see the sources for more messy details.)
>
> The key routines are CompiledIC::set_to_monomorphic,
> CompiledIC::set_to_megamorphic and their callees.
>
> It seems to me the problem of cross-PD detection breaks
> into the same pieces:
> - unlinked calls, where the VM linker routine gets to edit the site
> as before
> - monomorphic calls, where the decision whether to pop is made at
> link time
> - megamorphic calls, where the decision can be folded into the
> vtable/itable transition stubs
>
> In the case of a monomorphic site, the VM linker routine can decide
> at link time whether to defeat the tail call or not, and patch the
> call site accordingly.
>
agree
> In the less-common case of a megamorphic call site, the transition
> stub can be responsible to compare protection domains. This comparison
> can be moderately expensive (I think) because it will be comparatively
> infrequent. This affects the choice of PD token. I recommend that the
> PD token be the caller's class (klassOop). Comparing two PD tokens
> requires comparing their protection domain objects. The comparison
> takes an extra indirection, but it probably makes the code for the call
> site simpler, since klassOops are easy to introduce into code.

okay
>
> > Callee targets which are statically known to be in the same PD (a
> > potential optimization could be a PD analysis akin to the class
> > hierarchy analysis) would use the normal entry point (verified or
> > unverified entry point) while the others would go to the unverified
> > tail call entry point.
>
> Yes, the VM call site linker could use CHA to pick a more
> direct and efficient direct callee, the monomorphic method's
> VEP.

I think you miss understood me there.  I was referring to different
tail call entry points the linker can choose from as you suggest
below. (ignoring the outgoing argument problem for a moment)

as you say we would have the following cases:

 - PD statically known (maybe improved through a PD analysis - e.g all
   currently loaded method's with signature x are in the same PD) to
   be the same

    1. statically known callee: jump to unverified entry point

    2. monomorphic callee: jump to verified entry point

    3. megamorphic callee: jump to transition stub

 - PD might be different
    4. megamorphic calee: jump to vtable/itable stub which checks PD

> ---
> Another key question is who is the immediate callee.
>
> There are at least two cases: A tail call which still needs to do a PD
> check, and an unchecked tail call. Should there be two more entry
> points?

yes. probably more.

>
> Either or both entry points could instead be transition stubs,
> like vtable or itable dispatch stubs. There will be more such
> stubs, too, as a result of the invokedynamic work.

okay

> I do think we should build a faster tailcall than what .NET has.
> Having some tailcalls always trap to the runtime is a very uneven
> performance profile!

Agreed :)

I was not suggesting to do what .net does (might do) merely comparing
to it.

> -- John
>

So to summarize we have three (!) dimensions.

arguments (same-less/more), callee target (static, dynamic), PD known
to be equal or not

same/or less      |                callee target
   // more        |
arguments         |    static    |  monomorphic  |  megamorphic
__________________|_______________________________________________
                  | 1a           | 2a            | 3a
                  |              |               |
PD equal          |      //      |     //        |      //
                  |         1b   |           2b  |          3b
__________________|_______________________________________________
                  | 4a           | 5a            | 6a
                  |              |               |     //
PD (might) differ |              |               |         6b

(ascii -art needs fixed sized font)

1a.) eagerly pop the stack jump to callee

move arguments to appropiate place
pop stack
jump verified-entry-point

1b.) jump to special entry point that pops the stack (lazy) checks for
a dummy (adapter frame), shuffels arguments

set tail-call-token-or-fp %tail-call-token-reg
jump tail-call-check-for-dummy-frame-verified-entry-point

2.a) eagerly pop the stack jump to callee (unverified entry point)

pop stack
set #patchable token %inline_cache_reg
jump unverified_entry_point

unverified_entry_point is the monomorphic unverified entry point

2.b) lazy pop the stack  jump to special entry point/stub that takes down stack

set tail-call-token-or-fp %tail-call-token-reg
jump tail-call-check-for-dummy-frame-unverfied-entry-point

3ab.) exists if we do a pd-analysis(do all currently loaded classe
      with that signature have the same pd) like 2a/b but jump to
      megamorphic stub which in case of 3b checks for dummy (adapter)
      frame otherwise like 6a

4a.) defeat - since we statically know they differ

5a.) defeat - ditto

6a)  always lazily pop the stack since pd checks can cause defeat

set pd_klass_loader %pd_reg
set tail-call-token-or-fp %tail-call-token-reg
jump tail-call-check-pd-vtable/itable-stub

6b)
set pd_klass_loader %pd_reg
set tail-call-token-or-fp %tail-call-token-reg
jump tail-call-check-pd-and-dummy-frame-vtable/itable-stub
_

[on x86 we are running out of registers so probably use stack slots]

some of these cases might be collapsed. as some are more general
and would work for other cases (while being less efficient of
course).

The reason why i think it makes sense to differentiate between the
less/equal argument case and the callee has more argument case is that
i believe that a language front-end might know in advance what the
maximum number of arguments will be (for most cases). and so could
fill every function with dummy arguments. resulting in more efficient
tail calls.

int add2(int a, int b, int dummy, int dummy2, int dummy3, ...)
  return a+b;

the generated frames than always would have the correct frame
size. (no dummy frames needed)

moving of the dummy arguments might be optimized 'away' by the
compiler which could probably recognize that the arguments (dummy,
dummy2) has not changed during the function.

gee - and we haven't talk about what to do in case of interpreted to
compiled and vice versa :)

better me stops (big mouth) talking and starts coding.

regards arnold