Finding the spirit of L-World

Tue Jan 8 08:38:52 UTC 2019

----- Mail original -----
> De: "Brian Goetz" <brian.goetz at oracle.com>
> À: "valhalla-spec-experts" <valhalla-spec-experts at openjdk.java.net>
> Envoyé: Lundi 7 Janvier 2019 18:21:26
> Objet: Finding the spirit of L-World

> I’ve been processing the discussions at the Burlington meeting.  While I think
> we made a lot of progress, I think we fell into a few wishful-thinking traps
> with regard to the object model that we are exposing to users.  What follows is
> what I think is the natural conclusion of the L-World design — which is a model
> I think users can love, but requires us to go a little farther in what the VM
> does to support it.
> 
> 
> 
> 
> # Finding the Spirit of L-world
> 
> L-World is, at heart, an attempt to unify reference objects and
> values; they're unified under a common top type (`Object`), a common
> basic type descriptor (`L`), and a common set of bytecodes (`aload` et
> al.)  The war cry for L-World should be, therefore, "Everything is an
> Object". And users will be thrilled to see such a unification --
> assuming we can live up to the high expectations that such a promise
> sets.
> 
> By unifying references and values under a common type descriptor and
> supertype, we gain significant benefits for _migration_ -- that
> migrating a reference class to a value class does not break the ways
> existing code refers to it.

not more than using ACC_VALUE (see below)

> 
> By unifying under a common set of bytecodes, we gain significant
> benefits for _specialization_; the method body output by the compiler
> can apply equally to reference and value parameterizations, and all
> specialization can be applied on the constant pool only.
> 
> If our war cry is "Everything is an Object", we need to ask ourselves
> what behaviors uses should reasonably expect of all objects -- and
> ensure that values and references alike conform to those behaviors.

yes

> 
> ## Object model
> 
> In Q-world, we struggled with the fact that there was no true top
> type, but most code was written as if `Object` were the top type. This
> was trying to square a circle; the options for introducing a new  top
> type in Q-world were not good (an `Any` superclass provided the
> desired unification but a woefully confusing cost model; an
> `Objectible` interface shared between `Object` and values would set
> off a snail stampede to migrate libraries to use `Objectible` as the
> new fake top), but having multiple roots would have further
> exacerbated the pain of the existing bipartite type system.
> 
> L-world offers us an out; it makes `Object` a true top type (save for
> primitives -- but see "Poxing", below), so existing code that deals
> with `Object` can immediately accept values (save for totality -- but
> see "Totality", below) without requiring disruptive migration.
> 
> A sensible rationalization of the object model for L-World would be to
> have special subclasses of `Object` for references and values:
> 
> ```
> class Object { ... }
> class RefObject extends Object { ... }
> class ValObject extends Object { ... }
> ```
> 
> We would enforce that `RefObject` is only extended by classes that do
> not have the `ACC_VALUE` bit, that `ValObject` is only extended by
> classes that do have the `ACC_VALUE` bit, and that classes that claim
> to extend `Object` are implicitly reparented according to their
> `ACC_VALUE` bit.  (Actually, in this scheme, we can ditch the
> `ACC_VALUE` bit entirely; at load time, we just look at the
> superclass, and if it's `ValObject`, its a value class, otherwise
> it's a reference class.)

Introducing a ValObject as super or having the ACC_VALUE bit are two representation of the same thing for the VM.
In both case, we need to load the class to know if it's a value type or not, and in Java, the class loading is delayed which works well for reference type but not well for value type, if you have class that has a field which is a value type, you need to know if the field is a value type or not to be able to flatten it. With your proposal, the VM doesn't know if a field contains a value type or not until it's too late. Or are you suggesting to have a shape shifting objects (Skrulls objects) ?

> 
> Bringing ref-ness and val-ness into the type system in this way has
> many benefits:
> 
> - It reinforces the user's understanding of the relationship between
>   values and references.
> - It allows us to declare methods or fields that accept any object,
>   reference objects only, or value objects only, using existing
>   concepts.
> - It provides a place to declare ref-specific or val-specific methods,
>   and ref-specific or value-specific implementations of `Object`
>   methods.  (For example, we could implement `Object::wait` as a final
>   throwing method in `ValObject`, if that's the behavior we want).
> - It allows us to express ref-ness or val-ness as generic type
>   bounds, as in `<T extends RefObject>`.
> 
> We can pull the same move with nullability, by declaring an interface
> `Nullable`:
> 
> ```
> interface Nullable { }
> ```
> 
> which is implemented by `RefObject`, and, if we support value classes
> being declared as nullable, would be implemented by those value
> classes as well.  Again, this allows us to use `Nullable` as a
> parameter type or field type, or as a type bound (`<T extends
> Nullable>`).

same issue here, you want to know if something is nullable or not when you verify the bytecode, but at that point the class may not be loaded so you don't know if the class implements Nullable or not.

> 
> ## Totality
> 
> The biggest pain point in the LW1 model is that we're saying that
> everything is an `Object`, but we've had to distort the rules of
> `Object` operations in ways that users might find confusing.  LW1 says
> that equality comparison, identity hash code, locking, and
> `Object::wait` are effectively partial, but existing code that deals
> in `Object` may be surprised to find this out.  Additionally, arrays
> of reference objects are covariant with `Object`, but arrays of value
> objects are currently not.

for the last point, we have decided for now to try to not allow an array of value to be seen as an array of Object but i think it's a mistake because as you said it makes the code irregular.

> 
> #### Equality
> 
> The biggest and most important challenge is assigning sensible total
> semantics to equality on `Object`; the LW1 equality semantics are
> sound, but not intuitive.  There's no way we can explain why for
> values, you don't get `v == v` in a way that people will say "oh, that
> makes sense."  If everything is an object, `==` should be a reasonable
> equality relation on objects.  This leads us to a somewhat painful
> shift in the semantics of equality, but once we accept that pain, I
> think things look a lot better.
> 
> Users will expect (100% reasonably) the following to work:
> 
> ```
> Point p1, p2;
> 
> p1 == p1  // true
> 
> p2 = p1
> p1 == p2  // true
> 
> Object o1 = p1, o2 = p2;
> 
> o1 == o1  // true
> o1 == o2  // true
> ```
> 
> In LW1, if we map `==` to `ACMP`, they do not, and this will violate
> both user intuition and the spirit of "everything is an object".  (If
> everything is an object, then when we assign `o1 = p1`, this is just a
> widening conversion, not a boxing conversion -- it's the same
> underlying object, just with a new static type, so it should behave
> the same.)
> 
> The crux of the matter is that interfaces, and `Object` (which for
> purposes of this document should be considered an honorary interface)
> can hold either a reference or a value, but we've not yet upgraded our
> notion of interfaces to reflect this kind-polymorphism.  This is what
> we have to put on a sounder footing in order to not have users fall
> into the chasm of anomalies.  To start with:
> 
>  - A class is either a ref class or a value class.
>  - `C implements I` means that instances of `C` are instances of `I`.
>  - Interfaces are polymorphic over value and ref classes.
> 
> Now we need to define equality.  The terminology is messy, as so many
> of the terms we might want to use (object, value, instance) already
> have associations. For now, we'll describe a _substitutability_
> predicate on two instances:
> 
>  - Two refs are substitutable if they refer to the same object
>    identity.
>  - Two primitives are substitutable if they are `==` (modulo special
>    pleading for `NaN` -- see `Float::equals` and `Double::equals`).
>  - Two values `a` and `b` are substitutable if they are of the same
>    type, and for each of the fields `f` of that type, `a.f` and `b.f`
>    are substitutable.
> 
> We then say that for any two objects, `a == b` iff a and b are
> substitutable.
> 
> This is an "everything is an object" story that users can love!
> Everything is an object, equality is total and intuitive on objects,
> interfaces play nicely -— and there are no pesky boxes (except for
> primitives, but see below.)  The new concept here is that interfaces
> abstract over refs and values, and therefore operations that we want
> to be total on interfaces -- like equality -- have to take this seam
> into account.
> 
> The costs come in two lumps.  The first is that if we're comparing two
> objects, we first have to determine whether they are refs or values,
> and do something different for each.  We already paid this cost in
> LW1,  but here comes the bigger cost: if a value class has fields
> whose static types are interfaces, the comparison may have to recur on
> substitutability. This is horrifying for a VM engineer, but for users,
> this is just a day at the office -- `equals` comparisons routinely
> recur.  (For values known to (recursively) have no interface fields
> and no floating point fields, the VM can optimize comparison to a flat
> bitwise comparison.)
> 
> This model eliminates the equality anomalies, and provides users with
> an intuitive and sound basis for "same instance".

Currently, you can not do == on value types, i.e. point1 == point2 doesn't compile, if you want a unified equality, you have to use equals.
Users are already used to this weird semantics because this is the semantics of the wrapper object, java.lang.Integer, etc.
We don't have to retrofit == in the LIFE pattern because even if == returns false, there is an equals just after the == test.

I still prefer to stick with the simple concept of == means testing if the addresses are equal and false for a non nullable value type than trying to elevate == to a kind of equals but not exactly equals because it introduces another kind of "equality" (substituability) which as you said is an extension of the primitive equality but for me it's just another new semantics.

> 
> One might ask whether we really need to push this into `acmp`, or
> whether we can leave `acmp` alone and provide a new API point for
> substitutability, and have the compiler generate invocations of that.
> While the latter is OK for new code, doing so would cause old code to
> behave differently than new when operating on values (or interfaces
> that may hold values), and may cause it to change its behavior on
> recompile.  If we're changing what `Object` means, and what `aload`
> can operate on, we should update `acmp` accordingly.
> 
> #### `==` and `equals()`
> 
> Code that knows what type it is dealing with generally uses either
> `==` or `equals()`, but not both; generic code (such as `HashMap`)
> generally uses the idiom `a == b || a.equals(b)`.  Such code _could_
> fall back to just using `equals()`; this idiom arose as an
> optimization to avoid the virtual method invocation, but the first
> part can be dropped with no semantic loss.
> 
> As the cost of `==` gets higher, this optimization (as optimizations
> often do!) may begin to bite back; the `equals()` implementation often
> includes an `==` check as well.  There are lots of things we can do
> here, but it is probably best to wait to see what the actual
> performance impact is before doing anything.
> 
> #### Identity hash code
> 
> Because values have no identity, in LW1 `System::identityHashCode`
> throws `UnsupportedOperationException`.  However, this is
> unnecessarily harsh; for values, `identityHashCode` could simply
> return `hashCode`.  This would enable classes like `IdentityHashMap`
> (used by serialization frameworks) to accept values without
> modification, with reasonable semantics -- two objects would be deemed
> the same if they are `==`.  (For serialization, this means that equal
> values would be interned in the stream, which is probably what is
> wanted.)

It's not a reasonable semantics, if you serialize to JSON by example, you don't want share value types as object, you want to flatten them.
Serialization protocol will need to be updated to work with value types so throwing UnsupportedOperationException indicates to the user that it has to use an updated version of the serialization library he is using.

> 
> #### Locking
> 
> Locking is a difficult one.  On the one hand, it's bad form to lock on
> an object that hasn't explicitly invited you to participate in its
> locking protocol.  On the other hand, there is likely code out there
> that does things like lock on client objects, which might expect at
> least exclusion with other code that locks the same object, and a
> _happens-before_ edge between the release and the acquire.  Having
> locking all of a sudden throw `IllegalMonitorStateException` would
> break such code; while we may secretly root for such code to be
> broken, the reality is that such code is likely at the heart of large
> legacy systems that are difficult to modify.  So we may well be forced
> into totalizing locking in some way.  (Totalizing locking also means
> totalizing the `Object` methods related to locking, `wait`, `notify`,
> and `notifyAll`.)

Same issue as above, what ever the clever emulation you try to comes with, you have less knowledge that the maintainers of the code so it's better to ask them to fix the issue than to come with a trick that will never fully work. In term of compatibility, it's far easier to say, my library doesn't work with value types than my library maybe works with value types.

> 
> There are a spectrum of interpretations for totalizing locking, each
> with different tradeoffs:
> 
> - Treat locking on a value as an entirely local operation, providing
>   no exclusion and no happens-before edge.  Existing code will
>   continue to run when provided with values, but may produce
>   unexpected results.
> - Alternately, treat locking on a value as providing no exclusion,
>   but with acquire and release semantics.)  Wait and notify would
>   still throw.
> - Treat locking on a value as acquiring a fat lock (say, a global
>   value lock, a per-type value lock, etc.)  This gives us exclusion
>   and visibility, with a small risk of deadlock in situations where
>   multiple such locks are held, and a sensible semantics for wait
>   and notify (single notify would have to be promoted to `notifyAll`).
> - Treat locking on a value as acquiring a proxy lock which is
>   inflated by the runtime, which assigns a unique lock to each
>   distinguishable value.
> - Put lock-related methods on `ValObject`, whose defaults do one of
>   the above, and allow implementations to override them.
> 
> While nearly all of these options are horrifying, the goal here is
> not to do something _good_, but merely to do something _good enough_
> to avoid crushing legacy code.

It's engineering 101, there are two kind of good enough, the one in an application that you want because you control the data and the one in a library that you don't want because you don't control the data.

> 
> #### Array covariance
> 
> Currently, for any class `C`, `C[] <: Object[]`.  This makes
> `Object[]` the "top array type".  If everything is an object, then an
> array of anything should also be an array of `Object`.
> 
> There are two paths to delivering on this vision: extend traditional
> array covariance to value arrays (potentially making `aaload` sites
> megamorphic), or moving in the direction of "Arrays 2.0" and  define a
> specializable generic type `Array<T>` where the legacy arrays
> implement  `Array<T>`, and require clients to migrate from `T[]` to
> `Array<T>` before specializing their generic classes.
> 
> ## Poxing
> 
> The Model 3 specializer focused on specializing generics over
> primitives, not values (because we hadn't implemented values yet).
> Many of the complexities we ran into in that exploration stemmed from
> the accidental asymmetries between primitives and objects, including
> irregularities in the bytecode set (single vs double slot, `if_icmpeq`
> vs `dcmp` + `if`).  Having unified references and values, it would be
> really nice to unify primitives as well.
> 
> While we can't exactly do that easily, beacause of the intrusion to
> the bytecode set, we may be able to come close, using a modified
> boxing conversion.  The problem with the existing boxing conversion is
> that `Integer` is a heavy box with identity -- which means boxing is
> expensive.  There are two possible paths by which we could mitigate
> this pain:
> 
> - Migrate `Integer` to be a value type;
> - Create an alternate box for `int`, which is a value class (`ValInt`)
> 
> If we can box primitives to values, then we need not unify primitives
> with objects -- we just insert boxing conversions in the places we
> already do, and interpret specializations like `List<int>` to mean
> "List of int's box".
> 
> Migrating `Integer` to be a value may seem the obvious move, but it is
> fraught with compatibility constraints -- there is tons of legacy code
> that does things like locking on `Integer` or depending on it's
> strange accidental identity.  Perhaps if we could totalize locking and
> remove the public box constructors, we could get there -- but this
> is not a slam-dunk.

<troll>
no good enough solution here ?
</troll>

> 
> The alternative is creating a value box for primitives (a "pox") and
> adjust the compiler's boxing behavior (when boxing to `Object` or an
> interface, prefer the pox to the box).  This too has some
> compatibility concerns, such as code that deals in `Object` that
> assumes that primitives are always boxed to legacy boxes.  We may be
> able to finesse this by a trick -- to teach `instanceof` and
> `checkcast` of the relationship between boxes and poxes, so that code
> like:
> 
> ```
> if (o instanceof Integer) {
>    Integer i = (Integer) o;
>    // use o
> }
> ```
> 
> would work on both `Integer` and `int`'s pox (by saying "yes" in
> `instanceof` and doing the conversion in `checkcast`.)  This move,
> while somewhat risky, could allow us to relegate the legacy boxes to
> legacy, and eventually deprecate them.   (We could then have methods
> and intefaces on the poxes, and lift them to the primitives via
> poxing, so that `int` could be seen to implement `Comparable<int>` and
> you could call `compareTo()` on ints.)  While this would not be a true
> unification, it would come much closer than we are now.
> 
> Clearly, both alternatives are risky and require more investigation
> -- but both have promising payoffs.

It's something i've contemplating from some time now, make int.box a generic specialization of java.lang.Integer, so the instanceof/cast trick will work.

> 
> ## Migration
> 
> In both Q-world and L-world, we took care to ensure that for a value
> class `C`, the descriptor `LC;` describes a subtype of `Object`.  This
> is a key part of the story for migrating reference types to values,
> since clients of `C` will describe it with `LC;` and we don't want to
> require a flag day on migration.  In Q-world, `LC;` is the (nullable)
> box for `C`; in L-world, it is a nullable `C`.
> 
> This is enough that we can migrate a value-based class to a value and
> _existing binary clients_ will not break, even if they stuff a null
> into an `LC;`.  However, there are other migration compatibility
> concerns which we need to take up (which I'll do in a separate
> document.)
> 
> ## Generics
> 
> In Q-world, because values and references were so different,
> specializable generic classes had to be compiled with additional
> constraints.  For a specializable type variable `T`, we enforced:
> 
> - Cannot compare a `T` to `null`
> - Cannot assign `null` to a `T`
> - Cannot assign a `T` to `Object`
> - Cannot assign a `T[]` to `Object[]`
> - Cannot lock on a `T`
> - Cannot `==` on a `T`
> 
> In L-world, the need for most of these can go away.  Because
> everything is an object, we can assign values to `Object`, and
> `acmp_null` should work on all objects, so comparing with `null` is
> OK.  If we have array covariance, the array assignment restriction
> goes away.  If we totalize locking and equality, those restrictions go
> away.  The only restriction that remains is the assignment to `null`.
> But now the VM can express the difference between nullable values  and
> non-nullable values, and we can express this in the source type system
> with `Nullable`.  So all the Q-world restrictions go away, and they
> are replaced by an indication that a given type variable (or perhaps
> an entire generic class) is erased or reifiable, and we treat erased
> type variables as if they have an implicit `Nullable` bound.  Then the
> compile-time null-assignment restriction reduces to "does `T` have a
> `Nullable` bound", and the restriction against instantiating an erased
> generic class with a Q-type reduces to a simple bounds violation.
> 
> (There's lots more to cover on generics -- again, separate document.)

Rémi