Tailcalls in the jvm

Wed Feb 13 17:01:10 PST 2008

On Feb 13, 2008, at 2:29 PM, Arnold Schwaighofer wrote:

> On Feb 12, 2008 10:44 AM, John Rose <John.Rose at sun.com> wrote:
>> Basically, I propose the JVM should defeat a tail
>> call when it detects a cross-PD call.
>
> That is probably the path i'll take. But there is the problem of
> recognizing  cross-pd calls for dynamic call targets
> (virt./interface/method handle calls) as you suggest above.

> A solution could be to add another function entry point like the one
> for monomorphic calls. Let's call it unverified tail call entry point.
> It checks that the current function's PD and the caller's function PD
> are the same and jumps to the normal entry point if they match (fast
> path). If they are not the appropriate action is taken dealing with
> the situation (proceed as non-tail  call, setup adapter frame,throw
> exception).

---
One key question is who does the frame-popping.

One way is for the tail-caller to eagerly pop his own frame, but pass
a PD token either to the callee's special entry point, or to an  
intermediate
"transition stub".  Call this eager popping.  It is the standard  
technique.

The other way is for the tail-caller to try to keep his frame, but
pass a PD token plus a take-down cookie ("TDC", e.g., FP or frame size)
to the immediate callee.  (Which is either a special method entry
point or again a transition stub.)  Call this lazy popping.

The advantage of lazy popping is that you don't need to create
frames when a TC must be defeated.  All the frames stay "standard",
and this interacts most simply with deoptimization, debugging,
security checks, and anything else that walks the stack.

One disadvantage of lazy popping is the extra TDC argument
and (perhaps) overheads interpreting it.

With eager popping, each caller takes down his own frame,
by compiling customized code.

Eager popping seems to be the more aggressive technique.
One thing to worry about is whether there is enough stack
space in the tail-caller's caller to hold the outgoing arguments
to the tail-callee.  In general there isn't.  That consideration
alone might lead us to favor lazy popping.

You might think that lazy popping would make it impossible to
collapse ("crush") stack frames into adapters which summarize
PD contexts.  That's not true, since HotSpot has adequate stack
walking facilities.  When you decide stack space is a getting
tight after too many consecutive defeated tail calls, you trap
into the VM code, look at the stack, and reformat it, like
deoptimization does today.

I'm not sure which one to try first, but I lean towards
lazy popping, since it makes the stack look more "normal"
to the reflective parts of the VM runtime.

---
Another key question is what is the shape of a call site.

HotSpot call sites are simple and general "inline caches":
   set #patchable_token, %inline_cache_reg
   call patchable_immediate_callee

Two machine words are patchable here.  The initial
state is that the immediate callee is a linkage routine.

The usual state is called "monomorphic", where the
immediate callee is the "verified entry point" of the
actual method, and the token is the expected class,
which every VEP checks.  VEPs are the same
couple of instructions everywhere; they act like
inlined transition stubs.

This usual case includes most nominally virtual call sites.

The fallback state is called "megamorphic", where
the immediate callee is a transition stub that performs
dynamic dispatch.

A typical transition stub loads a the receiver's class,
loads a pointer from the vtable, and jumps through
the pointer.

The "patchable token" is garbage except for monomorphic
calls.  This makes most state transitions atomic by virtue
of an atomic patch to the immediate callee.  (It's not the
whole story; see the sources for more messy details.)

The key routines are CompiledIC::set_to_monomorphic,
CompiledIC::set_to_megamorphic and their callees.

It seems to me the problem of cross-PD detection breaks
into the same pieces:
  - unlinked calls, where the VM linker routine gets to edit the site  
as before
  - monomorphic calls, where the decision whether to pop is made at  
link time
  - megamorphic calls, where the decision can be folded into the  
vtable/itable transition stubs

In the case of a monomorphic site, the VM linker routine can decide
at link time whether to defeat the tail call or not, and patch the
call site accordingly.

In the less-common case of a megamorphic call site, the transition
stub can be responsible to compare protection domains.  This comparison
can be moderately expensive (I think) because it will be comparatively
infrequent.  This affects the choice of PD token.  I recommend that the
PD token be the caller's class (klassOop).  Comparing two PD tokens
requires comparing their protection domain objects.  The comparison
takes an extra indirection, but it probably makes the code for the call
site simpler, since klassOops are easy to introduce into code.

> Callee targets which are statically known to be in the same PD (a
> potential optimization could be a PD analysis akin to the class
> hierarchy analysis) would use the normal entry point (verified or
> unverified entry point) while the others would go to the unverified
> tail call entry point.

Yes, the VM call site linker could use CHA to pick a more
direct and efficient direct callee, the monomorphic method's
VEP.

---
Another key question is who is the immediate callee.

There are at least two cases:  A tail call which still needs to do a PD
check, and an unchecked tail call.  Should there be two more entry  
points?

Either or both entry points could instead be transition stubs,
like vtable or itable dispatch stubs.  There will be more such
stubs, too, as a result of the invokedynamic work.

> Pro: fast also for the dynamic callee case(if there is a efficient way
> to compare the two PDs - maybe caller passes its PD in a register) in
> the matching PD case
> Contra: might be more complex then the solution below

Maybe.

> Another solution would be to call always call a jvm runtime function
> (more overhead, always a slow path) to perform the tail call in cases
> where the call target is dynamic. I think this is what the .net
> runtime does/used to do? (JIT_TailCall,
> http://connect.microsoft.com/VisualStudio/feedback/ 
> ViewFeedback.aspx?FeedbackID=98236)
>
> Pro: less complex
> Contra: slower for dynamic call target case

I think it's simple enough to pass the PD (and TDC if necessary)
as extra arguments.

I do think we should build a faster tailcall than what .NET has.
Having some tailcalls always trap to the runtime is a very uneven  
performance profile!

-- John