The Great Startup Problem

Sun Aug 24 18:25:49 UTC 2014

On Sun, Aug 24, 2014 at 12:02 PM, Per Bothner <per at bothner.com> wrote:
> On 08/24/2014 03:46 AM, Marcus Lagergren wrote:
>> This is mostly invokedynamic related. Basically, an indy callsite
>> requires a lot of implicit class and byte code generation, that is
>> the source of the overhead we are mostly discussing. While tiered
>> compilation adds non determinism, it is usually (IMHO) bearable…

Indy aggravates the situation...it's easily an order of magnitude more
overhead at boot time.

I am also talking about startup time without indy, however. I'll try
to be more specific about our boot time overhead later in this reply.

> (1) Kawa shows you can have dynamic languages on the JVM that both
> run fast and have fast start-up.

Like Clojure, I'd only consider Kawa to be *somewhat* dynamic. Most
function calls can be statically dispatched, no? I think it's a poor
comparison to languages that have fully dynamic method lookup at all
(or most) sites.

> (2) Other dynamic languages (Ruby, JavaScript, PHP) have had more problems,
> possibly because they are "too dynamic".  Or perhaps just their kind of
> "dynamicism" is a poor match for the JVM.

They're not "too dynamic"...they're "pervasively dynamic". But this is
a red herring...I don't believe Ruby's dynamism is the source of our
startup time issues.

> (3) "Too dynamic" does not inherently mean a flaw in either the JVM *or*
> the language, just a mis-match.  (Though I'm of the school that believes
> "more staticness" is better for programmer productivity and software quality
> - as well as performance.  Finding the right tradeoff is hard.)

I believe development tasks will require not just a balance of
dynamism and staticism, but a range of languages along that spectrum.
There is no one true language, and no one true balance between dynamic
and static.

I think this is birdwalking away from the original problem, though.

> (4) Invokedynamic was a noble experiment to alleviate (2), but so far it
> does not seem to have solved the problems.

Conceptually, invokedynamic has proven itself incredibly capable. In
reality, the implementation has been harder and taken longer than we
expected. We're also butting up against a JVM that has been optimized
around Java for years...it's hard to teach that old dog new tricks.

> (5) It is reasonable to continue to seek improvements in invokedynamic,
> but in terms of resource prioritization other enhancement in the Java
> platform
> (value types, tagged values, reified generics, continuations, removing class
> size
> limitations, etc etc) are more valuable.

Many of which will probably use invokedynamic in some form under the
covers. Getting invokedynamic solid, fast, and predictable should be
priority one for JVM hackers right now.

> (6) That of course does not preclude an "aha": If we made modest change xyz,
> that could be a big help.  I just don't think Oracle or the community should
> spend too much time on "fixing" invokedynamic.

I disagree wholeheartedly! Invokedynamic is by far the best tool we
have going forward to extend the JVM and languages that run atop it.
It's going through growing pains, though.

I wanted to describe JRuby's boot time, so people don't think this is
a problem of a "too dynamic" language, or solely an invokedynamic
issue.

As with any other JVM languages, JRuby is almost entirely written in
Java. So our entire runtime needs to warm up before we get decent
performance. This includes:

* A very complicated parser. Ruby's grammar has been designed to
accommodate programmers rather than parsers, and it has thousands of
productions and state transitions. Note that all Ruby applications
boot from source every time they start up.

* An AST-based interpreter (JRuby 1.7). The AST nodes call each other,
and nested nodes deepen the stack. This is not as efficient,
memory-wise, as a flat instruction-based interpreter (IR in JRuby
9000), but it has excellent inlining characteristics. A CallNode
typically will call an ArgsNode to process args, a BlockArg node to
process captured closures, etc. So the AST kinda-sorta trace JITs on
the small. It's worth nothing that JRuby's AST interpreter, once warm,
is much faster at running Ruby code than cold, compiled Ruby (JVM
bytecode) in the JVM interpreter.

* A traditional CFG-based IR compiler (JRuby 9000). We have been
working to reduce the overhead of the new compiler, since it is
additional overhead compared to JRuby 1.7. We're getting there.

* An IR-based interpreter (JRuby 9000). The IR interpreter uses one
large frame for the interpreter and small frames for instruction
bodies. We have been working to manually inline just enough logic to
make the IR interpreter of similar or less overhead compared to the
AST interpreter. This may involve the introduction of
superinstructions, or we may get things "good enough" and rely on the
JVM bytecode JIT to take us the rest of the way.

* A JVM bytecode compiler, from either AST or IR. The latter is much
simpler, but this is still another phase of execution to warm up.

* All the core classes of Ruby. The String class in JRuby 9000 is
nearly 6600 lines of code. The Array class is nearly 4500. And so on.
Ruby provides an extremely rich set of core functionality at boot
time, without requires or imports.

* These classes also have to be wired up at boot time, which involves
binding into method tables some 2500 method objects across 200 class
objects, most of which happens for every boot. Obviously this could be
made lazier, but it's also not the lion's share of boot overhead.

* Portions of JRuby written in Ruby utilizing Java integration. JRuby
wants to be approachable to Ruby developers, and so we have a
Ruby-based kernel. Parts of this kernel implement JRuby functionality
using our Java integration layer, which must be bootstrapped
reflectively. So Java reflection comes into play for every boot, for
some dozens of classes and their supertypes.

Now of course much of this is magnified with booting a large
application. Tom Enebo ran some numbers and determined that running
unit tests for a bare Rails app (which involves booting Rails, loading
the build framework code, loading the test framework code, loading the
applicaiton, and loading the tests) defines another 4600 methods,
representing hundreds of .rb files and thousands of AST nodes. Most of
this never JITs and runs cold every time.

Of course, when code does JIT, then we're back to cold bytecode
performance, indy binding overhead, LF overhead, and so on. The
fastest settings for JRuby today are: no jit, no indy, tier 1 only.
And the startup difference between "--dev" mode and high-perf mode (or
even middle-ground lazy-jit-with-indy-plus-tiered-compiler) is
absolutely staggering.

- Charlie