The Great Startup Problem

Fri Aug 29 11:35:10 UTC 2014

I think this is an excellent summary, John. 

I also think that a scary point Charlie and I were trying to make in this thread is that we have <= 12 months or so to slim down existing mechanisms to something that works with startup for the existing indy solutions before 9 is frozen. I think, given the time constraints, that there is no way doing this by NOT working with the current compiler infrastructure. We already had some nice ideas in Santa Clara this August.  I don’t think 12 months is enough replace our entire indy architecture with something else entirely, or even to decide what that would be. For the sake of this discussion I am ONLY wearing my “lambda forms scale poorly, we need to fix that for nine” hat, which is a very, very pragmatic hat indeed. I am ONLY wearing this hat. Again: ONLY THE PRAGMATIC HAT. It contains very little architectural vision. 

I’d totally welcome future dynamic language threads that discuss completely new architectures, but it would be good if we could split this thread in more then. I’m just trying to ship a product for Java 9 that works. While I am not PM, 12 months is an incredibly short time to replace the entire world. I don’t think it has ever been done and I don’t think it CAN be done. Even if code exists. Sorry. Pragmatic hat.

Another point, that I know Charlie has brought up earlier, is that we also cannot abandon those who still target the JVM and seek value in a common platform. At least not before bytecode 2.0, whatever that may entail. Interesting discussion, but again - different thread I think. This means bytecode land. This means indy. 

So before we soar up into the cloud free architectural sky (in other threads), can we use this one to discuss things like local escape analysis around indys, maybe with cheating (knowing they are indys), local array explosion, indy callsite layout and by appropriating what we can of Fredrik Öhrström’s magic machine code hats? If this is not what you were after, Charlie, do tell me. I believe you were. 

I want make it absolutely clear, with no possibility of doubt, that I am merely pragmatic right now, which is all I feel I can find the time to be in the current timeframe. I need a near solution. Now. I am wearing blinkers. I am looking straight ahead. The cruel master sits heavy on my saddle with a backpack full of metaspace. I can’t, sadly, afford to live in any other world than this at the moment. Right now, I first want to get rid of lambda form scaling problems. Period. I am trying to ship product. After that we can discuss where our wonderful architectures will take us.

So if possible - let’s do one thing here: discuss what we can do for 9, in the time frame that is left, to make indys go faster for the use cases that have cropped up. With the current architecture. With lambda forms. With the nearly identical problems that both Charlie and we face. Let a thousand other threads start, discussing the rest, though. That is also great and exactly what this group is for, but I need some help with C1/C2 and invokedynamic at the moment! I need machine code rabbits! I think that machine code rabbits were also what Charlie asked for.

Regards
Marcus

(who is really happy to see MLVM-dev vibrant again, lately)

On 29 Aug 2014, at 05:48, John Rose <john.r.rose at oracle.com> wrote:

> On Aug 22, 2014, at 1:08 PM, Charles Oliver Nutter <headius at headius.com> wrote:
> 
>> Marcus coaxed me into making a post about our indy issues. Our indy
>> issues mostly surround startup and warmup time, so I'm making this a
>> general post about startup and warmup.
> 
> This is a vigorous and interesting discussion.  I will make some piecemeal replies to
> specific points, but first, as a HotSpot team member, I'll comment on the startup
> problem and then set out our position and vision for indy and dynamic languages.
> 
> Achilles had two heels, and so does Java:  Startup and warmup.  Many teams of many
> people have worked on these issues for years, with hard-won progress.  Compared
> with C programs (say, "/bin/awk"), the JVM takes a long time to get going.
> 
> Fundamentally this comes from the decision to package programs as bytecodes
> (JARs, etc.) rather than machine instructions plus mappable data (dylibs, etc.).
> As everyone on this list knows and appreciates, Java bytecodes are a portable,
> stable, and compact intermediate representation for both code and data structure.
> You can do an amazing variety of great things with them, and there is no credible
> proposal (IMO) to replace them wholesale with some other representation.
> 
> As an inherent property, bytecodes cannot be executed directly, requiring
> additional machinery to execute: an interpreter, JIT compiler, and/or AOT engine.
> Making this machinery look small is an ongoing challenge for us JVM implementors.
> I say "ongoing" (not just "historic") because software complexity scales up with time.
> 
> Our basic move is organizing cold stuff differently from hot stuff.
> The C2 JIT spends lots of effort on the hot paths in hot methods.
> On the cold side, Java's class data sharing feature tries to organize large
> amounts of start-up bytecode in a form that can be quickly loaded and discarded.
> The most useful distinctions between hot and cold data and code usually
> require some sort of online measurement, which is one reason Java is
> competitive with C++, which generally lacks behavior-dependent optimizations.
> 
> Building on these basic considerations, an ideal Java implementation would
> quickly get through the start-up process quickly by executing cold code and/or
> loading pre-configured data structures, with the result that data which must be
> configured by every JVM on startup, or by every invocation of a given application,
> is quickly assembled.  This would be done in some cases by loading a
> pre-composed bitwise data image, or perhaps more efficiently by "decompressing"
> an operationally encoded description of the initial data image by executing
> some sort of code.  In fact, that is what Java interpreters are pretty good at,
> when augmented by a pre-composed "class data archive" that maps in
> a pre-parsed version of classes from "rt.jar".  (Sometimes this image is
> found in a file called "classes.jsa".)
> 
> But if more of the initial state of a Java application could be demand-paged from
> an image file (like the initial data of C programs) there might be less latency,
> since there would certainly be fewer order constraints between the different
> loading events (at the page and cache line level), allowing prefetch machinery
> to hide latency.  This is an important area of growth for Java.
> 
> Second, an ideal Java implementation would quickly discover the particular
> on-line characteristics of the application under execution, and produce
> tailor-made machine code for that execution, with the result that the
> application's critical inner loops would execute at full performance,
> almost immediately.  This would be done in some cases by capturing
> profile information from early phases of the application and applying them
> to the creation of customized code.  In fact, the profiling interpreter and
> tiered C1 JIT are pretty good at gathering such information, and
> feeding it to the C2 JIT.
> 
> What's missing?  Probably a certain amount of prediction of behavior
> based on previous runs of the application, combined with a way of
> loading data and executing code that benefits from that prediction,
> even before the application has gotten that far.  To me that sounds
> like a static compiler of some sort, though an open-world kind that
> allows the compiler's model of the world to change during execution.
> 
> Maybe another missing bit is a slightly more static initialization model
> for Java, under which the assembly of initial data can be predicted
> off-line with greater accuracy.  For starters, we should have array
> initializers that are more distinct from random bytecode execution,
> plus a little more discipline about the effects of <clinit> methods,
> separating those effects into "yes we can statically model this"
> vs. "do you really want to do this crazy thing?".
> 
> One wrong answer would be to do more JIT stuff, earlier and oftener.
> But JIT optimizations are inherently throughput-oriented; they do
> nothing do reduce start-up or spin-up overheads; they often worsen
> those overheads.  (Example:  Iteration range splitting which makes
> the middle of a loop run without array range checks, at the expense
> of extra calculations at the beginning and end of the loop.)
> 
> So what we need is a new set of optimizations for application start-up,
> akin to those developed for managing static data and relocations in
> shared libraries.
> 
> One thing to note about Java's very dynamic startup semantics is that
> it's not always bad.  It is not always the right idea to load a statically
> composed bit-image of data or code, if that image is large compared
> with its true information content (e.g., a character range table).
> If the data is going to be traversed sequentially, sometimes it is
> better to expand it from a less complex representation.
> Compressed file systems exploit this insight, when they win.
> 
> The encoding of Java lambdas uses invokedynamic to make
> a decisively more compact encoding, relative to inner classes,
> which can be expanded in local memory faster (in some cases)
> than it can be loaded from cold bits in a JAR file.  This is one
> of the reasons Java 8 lambdas (expanded on the fly) beat inner
> classes (precompiled into a JAR).
> 
> The lesson here is that if you are going to precompile, consider
> precompiling something compact, and use a load-time translator
> as needed.  This insight has to be traded off against the complexity
> of the translation process, and also against the low-level advantage
> of a pre-fetch-capable load format (such as a brute image of an
> array of bytes).
> 
> That brings us to invokedynamic.  This is designed to be the swiss
> army knife of bytecode instructions, able to do the job of ten others,
> yet fitting in one compact carrying case.  When indy works right,
> a specialized call site is encoded in the natural number of tokens
> (constant pool entries), expanded quickly by its bootstrap method
> to a method handle, where the method handle has a simple and
> natural structure, and is compiled almost as quickly (pending only
> last touches of profile data) to optimal machine code.
> 
> As a test case, an indy call site is able to emulate any other concrete
> JVM instruction that can be described as N pops and 0 or 1 pushes,
> and in that case should quickly compile to the same machine code
> as if that concrete instruction were given instead.  (Why not throw
> out the other instructions then?  Well, they can be viewed as
> shorthand for the equivalent indy instructions, if we wish.  This
> is suggestive of possible shapes for Bytecode 2.0, but who
> knows when that will be.)
> 
> Conversely, when indy goes wrong, it is probably because there
> are an unnatural number of bytecode constructs involved (bad
> factoring) or the bootstrap method is doing something complicated
> (perhaps a cache is needed somewhere) or the resulting method
> handle is overly complex (too many "lambda forms" under the
> covers) or the compiler is stumbling over something and failing
> to inline and optimize (often shallow inlining or polluted profiles
> or surprise box/unbox events).
> 
> Put another way, when indy goes wrong, it is because the relatively
> simple programming model it offers is belied by internal simulation
> overheads.  Language programmers know about simulation overheads
> and can manage them explicitly, but indy provides an attractive way
> to refactor them, as long as it does not add too many more overheads.
> 
> Is it shocking that a JVM mechanism should have simulation overheads?
> No, it is not.  What is painful at this moment in time is that those overheads
> have not been reduced as quickly as we hoped.  There is a combination
> of reasons for this, which are not technically interesting.  It is still our
> position that those overheads are not inherent to the design, but rather
> implementation artifacts.  For example, for the sake of JDK 8 lambdas,
> we were able to remove enough overheads to make lambdas worthy
> successors to inner classes.
> 
> There is a range of proposals to overcome more of those overheads
> by tuning or replacing the implementation.  They all boil down to
> increased sharing of common structure early in the execution pipeline,
> with inlining and customization later allowing fully optimized JIT code.
> Fredrik's "rabbit out of the hat" technique radically simplifies the
> common structures, by relying on very strong box/unbox and varargs
> optimizations later in the pipeline.  The current architecture avoids
> reliance on those optimizations (which are hard to get 100% correct),
> at the expense of more redundant copies of the IR and the bytecodes.
> 
> (The metaphor of "pipeline" is apt, although a code pipeline has a
> different "fluid" running through each segment.  You can characterize
> a code pipeline by its phases:  Loader, linker, interpreter, JIT.
> Method handle IR adds more phases, including the lambda form
> interpreter and bytecode renderer.)
> 
> The current architecture makes multiple copies of each general shape
> of method handle, to distinguish different combinations of primitives vs.
> references.  Internally, these shapes are represented in an IR which
> shows up as "lambda forms".
> 
> The current story is that the cost of making a method handle is way
> too high:  We generate bit of IR for any non-trivial method handle,
> and those bits get composed and shoved down the code pipeline.
> This leads to a large load of code, some of it useful and some not.
> Also, the IR is retained in an executable form, causing heap bloat at scale.
> 
> These lambda forms are obnoxious in the heap and on the execution
> stack not because they are representing small distinctions between code
> shape, but rather because they are often generated one per method
> handle.  I intended this to be a temporary state, before 8 FCS, alas.
> We will be able to improve this state, soon, but it has been a long wait.
> 
> One of the reasons it has been difficult to make the improvement is that,
> paradoxically, the existing over-production of lambda forms has some
> advantages, akin to the advantages of over-inlining.  If your instruction
> processing pipeline can handle a 10x overload of redundant instructions,
> sometimes you win, because profiling and prediction tactics work better
> on single-use instructions, as opposed to shared-use instructions.
> This is true both at the bytecode and machine-code levels.
> 
> Our major challenge is to preserve existing performance peaks
> while reducing the volume of IR and bytecodes going down the
> pipeline.  The challenge is tied to the choice of IR, but not in
> such as radical way that we need an IR change.  The real
> challenge, as we see it, is twofold.
> 
> First, as method handles are built, recognize emergent similarities
> of structure, and reuse similar bits of IR to build similar method
> handles.  Put negatively, don't build full custom IR for each
> method handle.  Put positively, build generic IR if you can get
> away with it, or at least cache repeatedly built patterns, as possible.
> You know you have won when people stop noticing IR dregs
> in their heaps.
> 
> Second, as method handles are executed (and/or as their IR or
> bytecodes are executed), collect profile data that will be useful
> to the eventual full-custom object code at any or all of the relevant
> invocation sites—or nowhere, if those sites don't exist.
> 
> These goals are in tension with each other, regardless of what
> you call your IR.  Exploring this tension has been a key component
> of our recent engineering.  In basic terms, profile information should
> be attached as early as possible to the particular method handles
> bound at an invokedynamic site, while each chunk IR should be
> cagily shared among as many method handles as possible.
> 
> (The previous requirement is expressed in terms of indy call
> sites, but it should extend to other kinds of method handle
> call sites.  An indy call site is, in effect, an invokeExact of a
> compile-time constant method handle, but we also have
> optimizations for non-constant method handle invocations,
> both exact and inexact.  Still, the focus is on indy.)
> 
> We can use the "pipeline" metaphor to observe that overly long
> pipelines have drawbacks.  If code must be dynamically translated
> into several different states before full throughput, it is possible
> that some intermediate state will introduce an irregularity that
> later phases will not be able to normalize or optimize.  This is
> the case, for example, with the lambda form interpreter.  This
> exists (a) to work around some bootstrap issues, and (b) to allow
> programs to warm a little before committing to bytecode.
> (A third advantage is an executable documentation of semantics.)
> It appears that neither benefit is strong enough to overcome the
> disadvantage of having the JIT stumble over the lambda form
> interpreter, so we are likely to remove that phase.
> 
> It does not follow, however, that the other pipeline phases for
> lambda forms should also be removed.  In fact, neither bytecodes
> nor bundles of some lower-level IR are a replacement for lambda
> forms, insofar as they fill a special niche, the need for an easily
> created representation of a stand-alone, ad hoc JVM method.
> We need a way to build small bits of behavior that can be
> assembled, method-wise, into larger bits of behavior.
> There may be better ways to build such things than lambda
> forms, but any proposal to eliminate lambda forms needs to
> provide an up-front design for a way to compose, analyze,
> cache, and load ad hoc methods.
> 
> In the near future, we expect the overhead of a method handle to
> be more like the overhead of a closure and less like the overhead
> of an inner class.  This means that you will win if you "compress"
> your call sites using indy instead of open-coding them with many
> bytecodes.  The warm-up will process the compact form, leading
> to less net data motion for loading, while the hot form of the code
> will be indistinguishable from the open-coded version.
> 
> And we expect to get better at capturing the call-site specific types,
> values, and branch behavior of the contents of an invokedynamic site.
> 
> Such work is intended to replicate the hot-path effects of open-coded
> bytecodes, without their startup time bulk.  It worked for JDK 8 lambdas.
> We intend it to work for our other customers.
> 
> For example, one of our goals is that taking a generic method handle,
> shrinking it with "asType", and then binding it to an indy site, will
> end up having a customization effect similar to spinning customized
> bytecodes for the internals of the generic method handle.  This will
> get us close to Fredrik's "rabbit" design, even if we don't handle
> arity polymorphism any time soon.  Note that value types are pushing
> us in this direction also, since values effectively require the creation
> of many more customizations, in a system that cannot interchange
> boxed and unboxed representations.  An early part of this work is
> carefully defining more optimizer-friendly semantics for primitive
> boxes, to be extended to value boxes.
> 
> Moving forward, we expect invokedynamic to continue to provide a
> powerful way to reduce the complexity of bytecode shapes of advanced
> languages (including Java), without compromising on performance.
> 
> I hope this helps.  Please let me know if something is unclear or seems
> wrong; these are certainly slippery topics.
> 
> Best wishes,
> — John
> _______________________________________________
> mlvm-dev mailing list
> mlvm-dev at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/mlvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/mlvm-dev/attachments/20140829/4bc0c756/attachment-0001.html>