The Great Startup Problem

Mon Sep 1 16:13:41 UTC 2014

On Mon, Sep 1, 2014 at 2:07 AM, Vladimir Ivanov
<vladimir.x.ivanov at oracle.com> wrote:
> Stack usage won't be constant though. Each compiled LF being executed
> consumes 1 stack frame, so for a method handle chain of N elements, it's
> invocation consumes ~N stack frames.
>
> Is it acceptable and solves the problem for you?

This is acceptable for JRuby. Our worst-case Ruby method handle chain
will include at most:

* Two CatchExceptions for pre/post logic (heap frames, etc). Perf of
CatchException compared to literal Java try/catch is important here.
* Up to two permute arguments for differing call site/target argument ordering.
* Varargs negotiation (may be a couple handles)
* GWT
* SwitchPoint
* For Ruby to Java calls, each argument plus the return value must be
filtered to convert to/from Ruby types or apply an IRubyObject wrapper

This is worst case, mind you. Most calls in the system will be
arity-matched, eliminating the permutes. Most calls will be three or
fewer arguments, eliminating varargs. Many calls will be optimized to
no longer need a heap frame, eliminating the try/finally. The absolute
minimum for any call would be SwitchPoint plus GWT.

Of course I'm not counting DMHs here, since they're either the call we
want to make or they're leaf logic.

> We discussed an idea to generate custom bytecodes (single method) for the
> whole method handle chain (and have only 1 extra stack frame per MH
> invocation), but it defeats memory footprint reduction we are trying to
> archieve with LambdaForm sharing.

Funny thing...because indy slows our startup and increases our warmup
time, we're using our old binding logic by default. And surprise
surprise, our old binding logic does exactly this...one small
generated invoker class per method. I'm sure you're right that this
approach defeats the sharing and memory reduction we'd like to see
from LFs, but it works *really* well if you're ok with the extra class
and metaspace data in memory.

So there's one question: is the cost of a bytecoded adapter shim for
each method object really that high? Yes, if you're spinning new MHs
constantly or doing a million different adaptations of a given method.
But if you're just lazily creating an invoker shim once per method,
that really doesn't seem like a big deal.

My indy binding logic also has a dozen different flags for tweaking. I
can easily modify it to avoid doing all that pre/post logic and
argument permutation in the MH chain and just bind directly to the
generated invoker. Best (or worst) of both worlds? I just really don't
want to have to do that...I want everything from call site to target
method body to be in the MH chain.

For JRuby 9000, all try/finally logic will be within the target
method, so at least that part of the MH chain goes away.

Here's another idea...

We've been using my InvokeBinder library heavily in JRuby. It provides
a Java API/DSL for creating MH chains lazily from the top down:

MethodHandle mh = Binder.from(String.class, Object.class, Float.class)
        .tryFinally(finallyLogic)
        .permute(1, 0)
        .append("Hello")
        .drop(1)
        .invokeStatic(MyClass.class, "someMethod");

The adaptations are gathered within the Binder instance, playing
forward as you add adaptations and played backward at binding time to
make the appropriate MethodHandles and MethodHandle calls.

Duncan talked about how he was able to improve MH chain size and
performance by applying certain transformations in a different order,
among other things. InvokeBinder *could* be doing a lot more to
optimize the MH chain. For example, the above case never uses the
Object value passed in (it is permuted to position 1 and later
dropped), but that fact is obscured by the intervening append.

InvokeBinder is basically doing with MHs what MHs do with LFs. Perhaps
what we really need is a more holistic view of MH + LF operations
*together* so we can boil the whole thing down (even across MH lines)
before we start interpreting or compiling it?

- Charlie