abandon all U-types, welcome to L-world (or, what I learned in Burlington)

Brian Goetz brian.goetz at oracle.com
Wed Nov 22 13:48:24 UTC 2017


What's the L-world story for array subtyping?  For any R-type, R[] <: 
Object[].  If everything is an L type and everything is <: Object, are 
arrays of Q-types/primitives also subtypes of Object[]?

We didn't have a story for this in QU-world either, but at least in 
QU-world it was believable that QFoo[] <! Object[].  But that seems much 
less tenable when there's no syntactic difference between L-uses and 
Q-uses.  (And even less so when we might migrate code from L to Q.)



On 11/19/2017 4:40 PM, John Rose wrote:
> We just had a 50-hour week of face-to-face meetings by the
> Valhalla VM team.  We learned a lot and surprised ourselves
> by coming to a consensus that a promising design for value
> types uses mainly the same legacy L-type descriptors, makes
> relatively little use of Q-type descriptors, and does not appear
> to need a third descriptor "kind" or "mode", such as U for
> universal, or R for reference-only.
>
> First a few highlights out of many.  Fred Parain explained to us how
> he has prototyped a thread-local analog of Java heaps to store value
> structs in a form convenient to the interpreter.  Tobias Hartmann
> and Roland Westrelin (of Red Hat) explained what the compiler
> prefers to see, which is (obviously) the scalarized components
> of each value.  The three of them have worked out detailed
> rules for calling between interpreted and compiled code.
>
> It seems to me that other implementations of the JVM (looking
> at you, IBM) will tend in similar directions, so although our
> results are strongly informed by our own prototyping, we think
> it is likely that they will apply to other, independent JVM
> implementations.  (Or are there platforms where the interpreter
> will scalarize aggressively and the optimizer will prefer to
> keep everything in structs?  Not.)
>
> Karen Kinnear and the Oracle Valhalla lead, David Simms, were
> there to make sure we solved the important problems and asked
> the hard questions.  As a special appearance, one of our spec.
> gurus, Dan Smith, was there to help us make rigorous sense out
> of our intuitions and hacks.
>
> Since we were short on language experts, we just worked in
> the mode (my personal favorite) of pretending that the JVM
> is the most important thing, and the Java language designers
> will just have to figure out how to use it.  Of course, that's an
> oversimplification; the JLS and JVMS inform each other very
> strongly, but it was freeing to temporarily take current thoughts
> about JLS extensions as a given and vary the JVM to find
> the sweet spot that would be simple to implement and supportive
> to what we think we know about the Valhalla Java of the future.
>
> We had some long conversations about carrier types: L, Q, U,
> and more, and that's what I want to write about here.  We also
> make significant progress in the design of crackable lambdas,
> template classes, and current and future versions of condy.
> We talked to Ron Pressler about kick starting Loom fibers.
> But it is L-types I want to talk about here; the above is just a
> sketch of the past week's environment.
>
> Logically speaking, we have two things we want to do, and
> that unfolds to a choice between three "worlds" of up to four
> distinct kinds: L/Q/U/R.  L is always present because it is
> a legacy model for reference types.  Q is always present
> because we know we need (at least sometimes) to make
> a syntactic distinction between flattened values and legacy
> objects.
>
> (Why not just always look inside the classfile? Because
> the verifier cannot be expected to load a class for every
> type it sees, so needs a descriptor kind character from
> time to time.)
>
> The U kind came a year or two ago when we realized
> that any-generics (and/or templates) and interfaces both
> require a disjoint-union type that is neither Q nor L, but
> can keep track of Q payloads (value instances) and L
> payloads (nullable references to object instances),
> without mixing them up.  In other words, neither Q-types
> nor legacy L-types are parallel class-based constructs,
> and neither conveniently "sits on top" of the other; they
> need a common supertype to carry them without confusion.
>
> Before I describe the three logically possible "worlds",
> I'll add one more letter, R.  An R-type is exactly a legacy
> L-type, a nullable reference.  Why use a separate letter?
> Answer:  For the same reason we introduced the other
> kind letters, to preserve all the necessary distinctions
> among different kinds of payloads and carrier types,
> and also to talk about the explicit encoding of descriptors.
>
> There are three worlds we could design to hold both legacy
> R-types (today's L-types) and Q-types:  U-world, L-world,
> and R-world.  They might be notated respectively as U/QL,
> L/Q, and U/QR.
>
> The "U-world" is what I have been mentally preparing for
> for many months.  It is the design where L-types, marked
> as such in bytecode type descriptors, are always legacy
> object references or null, and Q-types, also marked as
> such in bytecodes, are always new value types.  To
> carry runtime payloads which may dynamically vary
> between the two modes, we need a third mode, U-types,
> which carry the two kinds of payloads (I hesitate to say
> "values" because I want to include reference values also).
>
> A U-type is a disjoint union between corresponding,
> similarly named Q-types and L-types.
>
> (Mathematically, a _disjoint union_ of C = A |_| B is no more
> and no less than the sum of all elements or points comprised
> by the two constituent sets A and B.  The disjoint union has
> nothing more: no points not in A or B.  It has nothing less:
> every point of C is from either A and B, but never both.
> If A and B somehow look like they have a non-empty
> intersection, then C is adjusted so as to keep straight
> which elements are from A and which are from B.)
>
> The "R-world" is a copy of the "U-world", except that the
> new world has no L-types at all, or rather they are renamed
> as R-types.  In this world, bridges would be required
> between legacy bytecodes (which use L's) and Valhalla
> bytecodes (which use R's for the same concept).
>
> We are pretty sure we don't want to live in R-world, but
> it helps to think about it, since it makes the maximum
> distinctions between legacy APIs and upgraded Valhalla
> APIs.  Any bridge from R-world to legacy code will
> presumably come after a clear decision has been made
> to allow the legacy code to see, under the name of L-types,
> the R's from the new world, plus whatever Q's are also
> allowed over the bridge to interoperate wit the old code.
>
> The U-world has similar need for bridges, but less extreme.
> We know we will need some bridges to upgrade legacy
> classes like List to use U-types (List<int>, List<ComplexDouble).
> The L-types of U-world just mix without effort into the legacy
> L-types of legacy classes, since the same letter is used.
>
> The third logical choice, and the one we are now looking
> at very seriously, is "L-world".  (Break out the "abandon
> all hope" and "Niflheim" jokes!)  In L-world, we identify
> (some would say conflate or confuse) the necessary
> U-type which unites R-types and Q-types with the legacy
> syntax "L".  The Q-type syntax is *maybe* needed, but
> in any case does not appear in a parallel position of
> importance with the dominant L-type syntax.  The R-type
> syntax seems even less important; we haven't thought
> of a use for it.  But it is in reserve, in case we need
> R-type descriptors for some corner case.
>
> The distinction between value types and object types
> is still fundamental, as is the distinction between flat
> and non-flat data.  The classfile which defines any
> given type unambiguously declares whether it is an
> object or value type.  But in L-world, the L-type
> descriptors can carry both payloads.  That's the
> key decision before us.
>
> (For brevity I'll say R-type/R-value when I mean a
> legacy nullable reference type/value, and Q-type/Q-value
> for value type/instance.  This doesn't mean that we
> will need Q's and R's in the final bytecode syntax.
> But they are useful concepts.)
>
> There are many implications from the decision to
> put L-types at the top:
>
> * The type L-Object ("Ljava/lang/Object;") carries both
> .  Thus, we don't need a
> new top-type.  (There are objectionable properties of
> L-Object which need remediation, but this was always
> true, and is not a showstopper for L-world.)
>
> * Likewise, legacy interface types like L-Comparable
> are immediately useful (without bridges) for carrying
> value instances as well as object instances (and null).
>
> * It is possible, in some cases, that standard and user-written
> collection classes will work correctly, without recompilation,
> with value types.  (This is a big claim, and valuable if true.
> Read on.)
>
> * All basic operations that the JVM applies to R-types must
> extend immediately and pervasively to Q-types, since it
> applies them to L-type values (which may be either,
> dynamically).
>
> * Today, simple movement of R-types is really cheap, just
> a machine pointer move.  That needs to be true for L-types
> in L-world, or else we will get systematic performance hits
> for legacy code, and new code will go slow too.
>
> * There are a number of object-specific operations which
> the JVM applies to L-types.  The most common is "acmp"
> (the "==" operator for references).  Those operations must
> be enhanced to do something useful with values, with a
> possible runtime cost to detect the distinction between
> an L-type carrying a Q-value and an L-type carrying a
> legacy R-value.  The performance and usable semantics
> of these object operations will make L-world either
> a programmer's paradise or a…  well you know.
>
> * There is no need for boxes, and they turn out to be
> undesirable.  Legacy types like java.lang.Integer must
> be given a golden watch and a pension, somehow.
> That's easy for the JVM but hard for the language,
> which mandates that "(Object)(int)x" produces an
> Integer rather than an "int".  It seemed a good idea
> at the time.
>
> * There is no need for a new "universal" carrier type,
> since L-types do the whole job.  Before the L-world
> discussion, my thought has been that we want a 128-bit
> U-type and a 64-bit legacy L/R-type.  Somebody burst
> my bubble this week, by saying that if we do that,
> we may find that interpreter speeds for U-type generics
> will risk a built-in performance barrier just from the
> larger standard carrier type.  If we JVM folks can agree
> that U-types should be 64-bits (by all available means)
> then it is just a simple step to rename U to L.  This is
> the rabbit hole that took our conversation down to L-world.
>
> * In L-world, the "acmp" instruction needs a very fast way
> to detect Q-values.  This *may* require a tag bit on the 64-bit
> root value.  That in turn will affect GC dynamics.  There is
> a delicate balance here—but we think there is a way through.
>
> * We probably need extra interpreter profiling to track whether
> a given L-value has ever been a Q-value or an R-value,
> dynamically.  Today we do null tracking on some instructions.
> This probably needs to be upgraded to null/Q/R tracking,
> and perhaps on additional instructions such as "acmp".
>
> * There are a number of ways to assign semantics to
> an object-like L operation when it encounters a Q-value.
> This will require additional mails, but I think we have
> identified about a half dozen models, of which one or two
> seem to be very promising:  Providing both useful semantics
> and amenable to optimization.
>
> * One residual use for Q-types is in the declaration of
> instance fields.  In order to avoid loading *all* classfiles
> of types mentioned in field declarations, a classfile which
> declares a flattened field will need to include enough
> information to allow the classfile loader to load *only*
> those fields marked as requiring flattening.  There are
> at least two ways to do this:  Use a Q-type descriptor
> syntax *only* for field declarations, as today.  Or,
> require the ACC_VALUE bit on field declarations which
> are supposed to be flattened.
>
> * As we were able to dispense with boxes, we may also
> dispense with non-flattened value types.  In that case,
> the translation strategy might emit an ACC_VALUE bit
> or Q-type on a field if and only if the classfile for the
> field's type defines it with ACC_VALUE.  The JVM will
> have to support non-flattened values in L-Object fields,
> of course.
>
> * If the system uses a thread-local store for value structures
> (to avoid heap traffic), a store barrier will have to quickly
> detect Q-types that are inside the thread and reallocate
> them to the heap, when they are first stored to the heap
> (e.g., as an element of an L-Object array).
>
> * The Q-type modifier *might* be useful in some settings
> to guarantee, in a verifiable way, that a given value is
> *not* an R-type, *not* null, and *not* modifiable; TBD.
>
> * The R-type modifier *might* be useful in some settings
> to guarantee, in a verifiable way, that a given value is
> *not* a Q-type, and *does* have an object identity or
> is null.  This is also TBD.
>
> * For best compatibility with legacy code, combined with
> diagnosability of anti-value algorithms like IdentityHashMap,
> the "acmp" instruction should return false unconditionally
> if either operand is a Q-value (punting to the following
> Object.equals call), and other object-like operations
> such as identityHashCode and monitorenter must throw
> errors in the JVM.  (In the language errors and warnings
> will be appropriate.)
>
> * New operations are needed for substitutability checks
> which generalize reference equality and hashcode.
> These can be system methods, and do not need to be
> loaded onto either new or old bytecodes.
>
> * We will almost certainly need to make primitives
> retroactively values.  This means "int" all along has
> really been Q-int (in the JVM) and is a real subtype
> of L-Object.
>
> * Covariant array subtyping only works for R-types.
> So both int[] and DoubleComplex[] are *not* subtypes
> of Object[], even though int and DoubleComplex *are*
> subtypes of Object.
>
> * From some points of view (legacy code), Q-values
> are masked invaders coming into the home of code
> which expected to work only on R-values.  Changing
> L-descriptors to encompass Q-values opens such
> code to potentially risky new behaviors.  Is it safe?
> Shouldn't we just have boxes to mediate values
> in such settings?  It depends on the code, really.
>
> There's more, but this is enough for one message.
>
> The L-world is very attractive:  No bridges or boxes,
> legacy code is value-enabled, and we get all the
> flattening we need.
>
> We need to do some experiments:  Can we afford
> the extra Q-checks on acmp and storage to the heap?
> Will legacy algorithms really work on masked but not
> boxed values?  Do other JVM implementations experience
> similar trade-offs, or is this only a HotSpot-centric set
> of compromises?  Can we really avoid all those new
> descriptors and bridges!!??
>
> Let's talk!
>
> — John
>
> P.S. Dan, you should send out your notes on U-types.



More information about the valhalla-spec-observers mailing list