The Great Startup Problem

Fri Aug 29 03:48:43 UTC 2014

On Aug 22, 2014, at 1:08 PM, Charles Oliver Nutter <headius at headius.com> wrote:

> Marcus coaxed me into making a post about our indy issues. Our indy
> issues mostly surround startup and warmup time, so I'm making this a
> general post about startup and warmup.

This is a vigorous and interesting discussion.  I will make some piecemeal replies to
specific points, but first, as a HotSpot team member, I'll comment on the startup
problem and then set out our position and vision for indy and dynamic languages.

Achilles had two heels, and so does Java:  Startup and warmup.  Many teams of many
people have worked on these issues for years, with hard-won progress.  Compared
with C programs (say, "/bin/awk"), the JVM takes a long time to get going.

Fundamentally this comes from the decision to package programs as bytecodes
(JARs, etc.) rather than machine instructions plus mappable data (dylibs, etc.).
As everyone on this list knows and appreciates, Java bytecodes are a portable,
stable, and compact intermediate representation for both code and data structure.
You can do an amazing variety of great things with them, and there is no credible
proposal (IMO) to replace them wholesale with some other representation.

As an inherent property, bytecodes cannot be executed directly, requiring
additional machinery to execute: an interpreter, JIT compiler, and/or AOT engine.
Making this machinery look small is an ongoing challenge for us JVM implementors.
I say "ongoing" (not just "historic") because software complexity scales up with time.

Our basic move is organizing cold stuff differently from hot stuff.
The C2 JIT spends lots of effort on the hot paths in hot methods.
On the cold side, Java's class data sharing feature tries to organize large
amounts of start-up bytecode in a form that can be quickly loaded and discarded.
The most useful distinctions between hot and cold data and code usually
require some sort of online measurement, which is one reason Java is
competitive with C++, which generally lacks behavior-dependent optimizations.

Building on these basic considerations, an ideal Java implementation would
quickly get through the start-up process quickly by executing cold code and/or
loading pre-configured data structures, with the result that data which must be
configured by every JVM on startup, or by every invocation of a given application,
is quickly assembled.  This would be done in some cases by loading a
pre-composed bitwise data image, or perhaps more efficiently by "decompressing"
an operationally encoded description of the initial data image by executing
some sort of code.  In fact, that is what Java interpreters are pretty good at,
when augmented by a pre-composed "class data archive" that maps in
a pre-parsed version of classes from "rt.jar".  (Sometimes this image is
found in a file called "classes.jsa".)

But if more of the initial state of a Java application could be demand-paged from
an image file (like the initial data of C programs) there might be less latency,
since there would certainly be fewer order constraints between the different
loading events (at the page and cache line level), allowing prefetch machinery
to hide latency.  This is an important area of growth for Java.

Second, an ideal Java implementation would quickly discover the particular
on-line characteristics of the application under execution, and produce
tailor-made machine code for that execution, with the result that the
application's critical inner loops would execute at full performance,
almost immediately.  This would be done in some cases by capturing
profile information from early phases of the application and applying them
to the creation of customized code.  In fact, the profiling interpreter and
tiered C1 JIT are pretty good at gathering such information, and
feeding it to the C2 JIT.

What's missing?  Probably a certain amount of prediction of behavior
based on previous runs of the application, combined with a way of
loading data and executing code that benefits from that prediction,
even before the application has gotten that far.  To me that sounds
like a static compiler of some sort, though an open-world kind that
allows the compiler's model of the world to change during execution.

Maybe another missing bit is a slightly more static initialization model
for Java, under which the assembly of initial data can be predicted
off-line with greater accuracy.  For starters, we should have array
initializers that are more distinct from random bytecode execution,
plus a little more discipline about the effects of <clinit> methods,
separating those effects into "yes we can statically model this"
vs. "do you really want to do this crazy thing?".

One wrong answer would be to do more JIT stuff, earlier and oftener.
But JIT optimizations are inherently throughput-oriented; they do
nothing do reduce start-up or spin-up overheads; they often worsen
those overheads.  (Example:  Iteration range splitting which makes
the middle of a loop run without array range checks, at the expense
of extra calculations at the beginning and end of the loop.)

So what we need is a new set of optimizations for application start-up,
akin to those developed for managing static data and relocations in
shared libraries.

One thing to note about Java's very dynamic startup semantics is that
it's not always bad.  It is not always the right idea to load a statically
composed bit-image of data or code, if that image is large compared
with its true information content (e.g., a character range table).
If the data is going to be traversed sequentially, sometimes it is
better to expand it from a less complex representation.
Compressed file systems exploit this insight, when they win.

The encoding of Java lambdas uses invokedynamic to make
a decisively more compact encoding, relative to inner classes,
which can be expanded in local memory faster (in some cases)
than it can be loaded from cold bits in a JAR file.  This is one
of the reasons Java 8 lambdas (expanded on the fly) beat inner
classes (precompiled into a JAR).

The lesson here is that if you are going to precompile, consider
precompiling something compact, and use a load-time translator
as needed.  This insight has to be traded off against the complexity
of the translation process, and also against the low-level advantage
of a pre-fetch-capable load format (such as a brute image of an
array of bytes).

That brings us to invokedynamic.  This is designed to be the swiss
army knife of bytecode instructions, able to do the job of ten others,
yet fitting in one compact carrying case.  When indy works right,
a specialized call site is encoded in the natural number of tokens
(constant pool entries), expanded quickly by its bootstrap method
to a method handle, where the method handle has a simple and
natural structure, and is compiled almost as quickly (pending only
last touches of profile data) to optimal machine code.

As a test case, an indy call site is able to emulate any other concrete
JVM instruction that can be described as N pops and 0 or 1 pushes,
and in that case should quickly compile to the same machine code
as if that concrete instruction were given instead.  (Why not throw
out the other instructions then?  Well, they can be viewed as
shorthand for the equivalent indy instructions, if we wish.  This
is suggestive of possible shapes for Bytecode 2.0, but who
knows when that will be.)

Conversely, when indy goes wrong, it is probably because there
are an unnatural number of bytecode constructs involved (bad
factoring) or the bootstrap method is doing something complicated
(perhaps a cache is needed somewhere) or the resulting method
handle is overly complex (too many "lambda forms" under the
covers) or the compiler is stumbling over something and failing
to inline and optimize (often shallow inlining or polluted profiles
or surprise box/unbox events).

Put another way, when indy goes wrong, it is because the relatively
simple programming model it offers is belied by internal simulation
overheads.  Language programmers know about simulation overheads
and can manage them explicitly, but indy provides an attractive way
to refactor them, as long as it does not add too many more overheads.

Is it shocking that a JVM mechanism should have simulation overheads?
No, it is not.  What is painful at this moment in time is that those overheads
have not been reduced as quickly as we hoped.  There is a combination
of reasons for this, which are not technically interesting.  It is still our
position that those overheads are not inherent to the design, but rather
implementation artifacts.  For example, for the sake of JDK 8 lambdas,
we were able to remove enough overheads to make lambdas worthy
successors to inner classes.

There is a range of proposals to overcome more of those overheads
by tuning or replacing the implementation.  They all boil down to
increased sharing of common structure early in the execution pipeline,
with inlining and customization later allowing fully optimized JIT code.
Fredrik's "rabbit out of the hat" technique radically simplifies the
common structures, by relying on very strong box/unbox and varargs
optimizations later in the pipeline.  The current architecture avoids
reliance on those optimizations (which are hard to get 100% correct),
at the expense of more redundant copies of the IR and the bytecodes.

(The metaphor of "pipeline" is apt, although a code pipeline has a
different "fluid" running through each segment.  You can characterize
a code pipeline by its phases:  Loader, linker, interpreter, JIT.
Method handle IR adds more phases, including the lambda form
interpreter and bytecode renderer.)

The current architecture makes multiple copies of each general shape
of method handle, to distinguish different combinations of primitives vs.
references.  Internally, these shapes are represented in an IR which
shows up as "lambda forms".

The current story is that the cost of making a method handle is way
too high:  We generate bit of IR for any non-trivial method handle,
and those bits get composed and shoved down the code pipeline.
This leads to a large load of code, some of it useful and some not.
Also, the IR is retained in an executable form, causing heap bloat at scale.

These lambda forms are obnoxious in the heap and on the execution
stack not because they are representing small distinctions between code
shape, but rather because they are often generated one per method
handle.  I intended this to be a temporary state, before 8 FCS, alas.
We will be able to improve this state, soon, but it has been a long wait.

One of the reasons it has been difficult to make the improvement is that,
paradoxically, the existing over-production of lambda forms has some
advantages, akin to the advantages of over-inlining.  If your instruction
processing pipeline can handle a 10x overload of redundant instructions,
sometimes you win, because profiling and prediction tactics work better
on single-use instructions, as opposed to shared-use instructions.
This is true both at the bytecode and machine-code levels.

Our major challenge is to preserve existing performance peaks
while reducing the volume of IR and bytecodes going down the
pipeline.  The challenge is tied to the choice of IR, but not in
such as radical way that we need an IR change.  The real
challenge, as we see it, is twofold.

First, as method handles are built, recognize emergent similarities
of structure, and reuse similar bits of IR to build similar method
handles.  Put negatively, don't build full custom IR for each
method handle.  Put positively, build generic IR if you can get
away with it, or at least cache repeatedly built patterns, as possible.
You know you have won when people stop noticing IR dregs
in their heaps.

Second, as method handles are executed (and/or as their IR or
bytecodes are executed), collect profile data that will be useful
to the eventual full-custom object code at any or all of the relevant
invocation sites—or nowhere, if those sites don't exist.

These goals are in tension with each other, regardless of what
you call your IR.  Exploring this tension has been a key component
of our recent engineering.  In basic terms, profile information should
be attached as early as possible to the particular method handles
bound at an invokedynamic site, while each chunk IR should be
cagily shared among as many method handles as possible.

(The previous requirement is expressed in terms of indy call
sites, but it should extend to other kinds of method handle
call sites.  An indy call site is, in effect, an invokeExact of a
compile-time constant method handle, but we also have
optimizations for non-constant method handle invocations,
both exact and inexact.  Still, the focus is on indy.)

We can use the "pipeline" metaphor to observe that overly long
pipelines have drawbacks.  If code must be dynamically translated
into several different states before full throughput, it is possible
that some intermediate state will introduce an irregularity that
later phases will not be able to normalize or optimize.  This is
the case, for example, with the lambda form interpreter.  This
exists (a) to work around some bootstrap issues, and (b) to allow
programs to warm a little before committing to bytecode.
(A third advantage is an executable documentation of semantics.)
It appears that neither benefit is strong enough to overcome the
disadvantage of having the JIT stumble over the lambda form
interpreter, so we are likely to remove that phase.

It does not follow, however, that the other pipeline phases for
lambda forms should also be removed.  In fact, neither bytecodes
nor bundles of some lower-level IR are a replacement for lambda
forms, insofar as they fill a special niche, the need for an easily
created representation of a stand-alone, ad hoc JVM method.
We need a way to build small bits of behavior that can be
assembled, method-wise, into larger bits of behavior.
There may be better ways to build such things than lambda
forms, but any proposal to eliminate lambda forms needs to
provide an up-front design for a way to compose, analyze,
cache, and load ad hoc methods.

In the near future, we expect the overhead of a method handle to
be more like the overhead of a closure and less like the overhead
of an inner class.  This means that you will win if you "compress"
your call sites using indy instead of open-coding them with many
bytecodes.  The warm-up will process the compact form, leading
to less net data motion for loading, while the hot form of the code
will be indistinguishable from the open-coded version.

And we expect to get better at capturing the call-site specific types,
values, and branch behavior of the contents of an invokedynamic site.

Such work is intended to replicate the hot-path effects of open-coded
bytecodes, without their startup time bulk.  It worked for JDK 8 lambdas.
We intend it to work for our other customers.

For example, one of our goals is that taking a generic method handle,
shrinking it with "asType", and then binding it to an indy site, will
end up having a customization effect similar to spinning customized
bytecodes for the internals of the generic method handle.  This will
get us close to Fredrik's "rabbit" design, even if we don't handle
arity polymorphism any time soon.  Note that value types are pushing
us in this direction also, since values effectively require the creation
of many more customizations, in a system that cannot interchange
boxed and unboxed representations.  An early part of this work is
carefully defining more optimizer-friendly semantics for primitive
boxes, to be extended to value boxes.

Moving forward, we expect invokedynamic to continue to provide a
powerful way to reduce the complexity of bytecode shapes of advanced
languages (including Java), without compromising on performance.

I hope this helps.  Please let me know if something is unclear or seems
wrong; these are certainly slippery topics.

Best wishes,
— John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/mlvm-dev/attachments/20140828/c1fc2167/attachment-0001.html>