Updated SoV, take 3

Wed Jul 27 20:22:09 UTC 2022

I got through half of it, maybe more, so far.

Several of my suggestions are of a similar form, "I would also point out X
here and now", because in those places I suspect a nontrivial number of
readers may have "but wait a minute" reactions that will be distracting.

Of course, I am happy if this is the end of "primitive classes". :-)

On Tue, Jul 26, 2022 at 11:18 AM Brian Goetz <brian.goetz at oracle.com> wrote:

> Yet another attempt at updating SoV to reflect the current thinking.
> Please review.
>
>
> # State of Valhalla
> ## Part 2: The Language Model {.subtitle}
>
> #### Brian Goetz {.author}
> #### July 2022 {.date}
>
> > _This is the second of three documents describing the current State of
>   Valhalla.  The first is [The Road to Valhalla](01-background); the
>   third is [The JVM Model](03-vm-model)._
>
> This document describes the directions for the Java _language_ charted by
> Project Valhalla.  (In this document, we use "currently" to describe the
> language as it stands today, without value classes.)
>
> Valhalla started with the goal of providing user-programmable classes
> which can
> be flat and dense in memory.  Numerics are one of the motivating use cases;
> adding new primitive types directly to the language has a very high
> barrier.  As
> we learned from [Growing a Language][growing] there are infinitely many
> numeric
> types we might want to add to Java, but the proper way to do that is via
> libraries, not as a language feature.
>
> ## Primitive and objects today
>
> Java currently has eight built-in primitive types.  Primitives represent
> pure
> _values_; any `int` value of "3" is equivalent to, and indistinguishable
> from,
> any other `int` value of "3".  Because primitives are "just their bits"
> with no
> ancillarly state such as object identity, they are _freely copyable_;
> whether
> there is one copy of the `int` value "3", or millions, doesn't matter to
> the
> execution of the program.  With the exception of the unusual treatment of
> exotic
> floating point values such as `NaN`, the `==` operator on primitives
> performs a
> _substitutibility test_ -- it asks "are these two values the same value".
>

I've said this before, but I think both "substitutability" and "sameness"
just lead to more questions, and I'm not sure why we don't appeal to
distinguishability instead.

> Java also has _objects_, and each object has a unique _object identity_.
> This
> means that each object must live in exactly one place (at any given time),
> and
> this has consequences for how the JVM lays out objects in memory.  Objects
> in
> Java are not manipulated or accessed directly, but instead through _object
> references_.  Object references are also a kind of value -- they encode the
> identity of the object to which they refer,
>

Do we really want to invoke identity here? That surprises me. That suggests
that a `ValueClass.ref` instance will have identity too.
Isn't it really only about the object being addressable or locatable (some
term like that)?

> and the `==` operator on object
> references also performs a substitutibility test, asking "do these two
> references refer to the same object."  Accordingly, object _references_
> (like
> other values) can be freely copied, but the objects they refer to cannot.
>
> This dichotomy -- that the universe of values consists of primitives and
> object
> references -- has long been at the core of Java's design.  JVMS 2.2 (Data
> Types)
> opens with:
>
> > There are two kinds of values that can be stored in variables, passed as
> > arguments, returned by methods, and operated upon: primitive values and
> > reference values.
>
> Primitives and objects currently differ in almost every conceivable way:
>
> | Primitives                                 |
> Objects                            |
> | ------------------------------------------ |
> ---------------------------------- |
> | No identity (pure values)                  |
> Identity                           |
> | `==` compares values                       | `==` compares object
> identity      |
> | Built-in                                   | Declared in
> classes                |
> | No members (fields, methods, constructors) | Members (including mutable
> fields) |
> | No supertypes or subtypes                  | Class and interface
> inheritance    |
> | Accessed directly                          | Accessed via object
> references     |
> | Not nullable                               |
> Nullable                           |
> | Default value is zero                      | Default value is
> null              |
> | Arrays are monomorphic                     | Arrays are
> covariant               |
> | May tear under race                        | Initialization safety
> guarantees   |
> | Have reference companions (boxes)          | Don't need reference
> companions    |
>
> Primitives embody a number tradeoffs aimed at maximizing the performance
> and
> usability of the primitive types.  Reference types default to `null`,
> meaning
> "referring to no object", and must be initialized before use; primitives
> default
> to a usable zero value (which for most primitives is the additive
> identity) and
> therefore may be used without initialization.  (If primitives were
> nullable like
> references, not only would this be less convenient in many situations, but
> they
> would likely consume additional memory footprint to accomodate the
> possibility
> of nullity, as most primitives already use all their bit patterns.)
> Similarly,
> reference types provide initialization safety guarantees for final fields
> even
> under a certain category of data races (this is where we get the "immutable
> objects are always thread-safe" rule from); primitives allow tearing under
> race
> for larger-than-32-bit values.  We could characterize the design principles
> behind these tradeoffs are "make objects safer, make primitives faster."
>
> The following figure illustrates the current universe of Java's types.  The
> upper left quadrant is the built-in primitives; the rest of the space is
> reference types.  In the upper-right, we have the abstract reference types
> --
> abstract classes, interfaces, and `Object` (which, though concrete, acts
> more
> like an interface than a concrete class).  The built-in primitives have
> wrappers
> or boxes, which are reference types.
>
> <figure>
>   <a href="field-type-zoo.pdf" title="Click for PDF">
>     <img src="field-type-zoo-old.png" alt="Current universe of Java field
> types"/>
>   </a>
> </figure>
>
> Valhalla aims to unify primitives and objects such that both are declared
> with
> classes, but maintains the special runtime characteristics -- flatness and
> density -- that primitives currently enjoy.
>
> ### Primitives and boxes today
>
> The built-in primitives are best understood as _pairs_ of types: the
> primitive
> type (`int`) and its reference companion type (`Integer`), with built-in
> conversions between the two.  The two types have different characteristics
> that
> makes each more or less appropriate for a given situations. Primitives are
> optimized for efficient storage and access: they are monomorphic, not
> nullable,
> tolerate uninitialized (zero) values, and larger primitive types (`long`,
> `double`) may tear under racy access.  The box types add back the
> affordances of
> references -- nullity, polymorphism, interoperation with generics, and
> initialization safety -- but at a cost.
>
> Valhalla generalizes this primitive-box relationship, in a way that is more
> regular and extensible and reduces the "boxing tax".
>
> ## Eliminating unwanted object identity
>
> Many impediments to optimization stem from _unwanted object identity_. For
> many
> classes, not only is identity not directly useful, it can be a source of
> bugs.
> For example, due to caching, `Integer` can be accidentally compared
> correctly
> with `==` just often enough that people keep doing it.  Similarly,
> [value-based
> classes][valuebased] such as `Optional` have no need for identity, but pay
> the
> costs of having identity anyway.
>
> Valhalla allows classes to explicitly disavow identity by declaring them as
> _value classes_.  The instances of a value class are called _value
> objects_.
>
> ```
> value class Point implements Serializable {
>     int x;
>     int y;
>
>     Point(int x, int y) {
>         this.x = x;
>         this.y = y;
>     }
>
>     Point scale(int s) {
>         return new Point(s*x, s*y);
>     }
> }
> ```
>
> This says that an `Point` is a class whose instances have no identity.  As
> a
> consequence, it must give up the things that depend on identity; the class
> and
> its fields are implicitly final.  Additionally, operations that depended on
> identity must either be adjusted (`==` on value objects compares state, not
> identity) or disallowed (it is illegal to lock on a value object.)
>

Just for broad understandability, you might want to address here "but then
how could a reference 'identify' what object it's pointing to?"

Value classes can still have most of the affordances of classes -- fields,
> methods, constructors, type parameters, superclasses (with some
> restrictions),
> nested classes, class literals, interfaces, etc.  The classes they can
> extend
> are restricted: `Object` or abstract classes with no instance fields, empty
> no-arg constructor bodies, no other constructors, no instance
> initializers, no
> synchronized methods, and whose superclasses all meet this same set of
> conditions.  (`Number` is an example of such an abstract class.)
>
> Because `Point` has value semantics, `==` compares by state rather than
> identity.  This means that value objects, like primitives, are _freely
> copyable_; we can explode them into their fields and re-aggregate them into
> another value object, and we cannot tell the difference.
>

It feels like if this wants to rest some stuff on "comparing by state" it
ought to explain here what that means? Or, I guess at least a forward
reference.
It seems pretty important to understand that it means shallow fieldwise
delegation back to `==` again, meaning that fields of identity types are
still identity-compared.
In many contexts "value semantics" and "comparing by state" tend to only
make sense if done recursively/deeply.

> So far we've addressed the first two lines in our table of differences;
> rather
> than all objects having identity, classes can opt into, or out of, object
> identity for their instances.  By allowing classes to exclude unwanted
> identity,
> we free the runtime to make better layout and compilation decisions.
>
> ### Example: immutable cursors
>
> Collections today use `Iterator` to facilitate traversal through the
> collection,
> which store iteration state in mutable fields.  While heroic optimizations
> such
> as _escape analysis_ can sometimes eliminate the cost associated with
> iterators,
> such optimizations are fragile and hard to rely on.  Value objects offer an
> iteration approach that is more reliably optimized: immutable cursors.
> (Without
> value objects, immutable cursors would be prohibitively expensive for
> iteration.)
>
> ```
> value class ArrayCursor<T> {
>     T[] array;
>     int offset;
>
>     public ArrayCursor(T[] array, int offset) {
>         this.array = array;
>         this.offset = offset;
>     }
>
>     public ArrayCursor(T[] array) {
>         this(array, 0);
>     }
>
>     public boolean hasNext() {
>         return offset < array.length;
>     }
>
>     public T next() {
>         return array[offset];
>     }
>
>     public ArrayCursor<T> advance() {
>         return new ArrayCursor(array, offset+1);
>     }
> }
> ```
>
> In looking at this code, we might mistakenly assume it will be
> inefficient, as
> each loop iteration appears to allocate a new cursor:
>
> ```
> for (ArrayCursor<T> c = new ArrayCursor<>(array);
>      c.hasNext();
>      c = c.advance()) {
>     // use c.next();
> }
> ```
>
> In reality, we should expect that _no_ cursors are actually allocated
> here.  An
> `ArrayCursor` is just its two fields, and the runtime is free to scalarize
> the
> object into its fields and hoist them into registers.  The calling
> convention
> for `advance` is optimized so that both receiver and return value are
> scalarized.  Even without inlining `advance`, no allocation will take
> place,
> just some shuffling of the values in registers.  And if `advance` is
> inlined,
> the client code will compile down to having a single register increment and
> compare in the loop header.
>
> ### Migration
>
> The JDK (as well as other libraries) has many [value-based
> classes][valuebased]
> such as `Optional` and `LocalDateTime`.  Value-based classes adhere to the
> semantic restrictions of value classes, but are still identity classes --
> even
> though they don't want to be.  Value-based classes can be migrated to true
> value
> classes simply by redeclaring them as value classes, which is both source-
> and
> binary-compatible.
>

This gave me a slight "huh, then what's the catch?" reaction. It might make
more sense by adding the fact right away that any errant usages (that don't
adhere to the VBC requirements) will start failing at runtime, and might
cause compilation warnings?

We plan to migrate many value-based classes in the JDK to value classes.
> Additionally, the primitive wrappers can be migrated to value classes as
> well,
> making the conversion between `int` and `Integer` cheaper; see "Migrating
> the
> legacy primitives" below.  (In some cases, this may be _behaviorally_
> incompatible for code that synchronizes on the primitive wrappers.  [JEP
> 390][jep390] has supported both compile-time and runtime warnings for
> synchronizing on primitive wrappers since Java 16.)
>

Putting this in parens under the topic of the primitive wrappers feels like
"pulling a fast one". Like it's pretending that this incompatibility
problem is somehow unique to those 8 classes, hoping people won't notice
"wait a minute, *any* class hopeful of future migration would have the same
desire to opt into such warnings in advance." (And for more than just
synchronization.) I get that there is no current plan to solve that
problem, but we could be more up-front about that?

(Cross-reference my earlier agitations about this in a thread called "we
need help migrating from bucket 1 to 2...", maybe a couple months ago.)

<figure>
>   <a href="field-type-zoo.pdf" title="Click for PDF">
>     <img src="field-type-zoo-mid.png" alt="Java field types adding value
> classes"/>
>   </a>
> </figure>
>
> ### Identity-sensitive operations
>
> Certain operations are currently defined in terms of object identity.  As
> we've
> already seen, some of these, like equality, can be sensibly extended to
> cover
> all instances.  Others, like synchronization, will become partial.
> Identity-sensitive operations include:
>
>   - **Equality.**  We extend `==` on references to include references to
> value
>     objects.  Where it currently has a meaning, the new definition
> coincides
>     with that meaning.
>
>   - **System::identityHashCode.**  The main use of `identityHashCode` is
> in the
>     implementation of data structures such as `IdentityHashMap`.  We can
> extend
>     `identityHashCode` in the same way we extend equality -- deriving a
> hash on
>     value objects from the hash of all the fields.
>
>   - **Synchronization.**  This becomes a partial operation.  If we can
>     statically detect that a synchronization will fail at runtime
> (including
>     declaring a `synchronized` method in a value class), we can issue a
>     compilation error; if not, attempts to lock on a value object results
> in
>     `IllegalMonitorStateException`.  This is justifiable because it is
>     intrinsically imprudent to lock on an object for which you do not have
> a
>     clear understanding of its locking protocol; locking on an arbitrary
>     `Object` or interface instance is doing exactly that.
>
>   - **Weak, soft, and phantom references.**  Capturing an exotic reference
> to a
>     value object becomes a partial operation, as these are intrinsically
> tied to
>     reachability (and hence to identity).  However, we will likely make
>     enhancements to `WeakHashMap` to support mixed identity and value
> keys.
>
> ### Value classes and records
>
> While records have a lot in common with value classes -- they are final and
> their fields are final -- they are still identity classes.  Records embody
> a
> tradeoff: give up on decoupling the API from the representation, and in
> return
> get various syntactic and semantic benefits.  Value classes embody another
> tradeoff: give up identity, and get various semantic and performance
> benefits.
> If we are willing to give up both, we can get both sets of benefits, by
> declaring a _value record_.
>
> ```
> value record NameAndScore(String name, int score) { }
> ```
>
> Value records combine the data-carrier idiom of records with the improved
> scalarization and flattening benefits of value classes.
>
> In theory, it would be possible to apply `value` to certain enums as well,
> but
> this is not currently possible because the `java.lang.Enum` base class that
> enums extend do not meet the requirements for superclasses of value
> classes (it
> has fields and non-empty constructors).
>
> ### Value and reference companion types
>
> Value classes are generalizations of primitives.  Since primitives have a
> reference companion type, value classes actually give rise to _pairs_ of
> types:
> a value type and a reference type.  We've seen the reference type already;
> for
> the value class `ArrayCursor`, the reference type is called `ArrayCursor`,
> just
> as with identity classes.  The full name for the reference type is
> `ArrayCursor.ref`; `ArrayCursor` is just a convenient alias for that.
> (This
> aliasing is what allows value-based classes to be compatibly migrated to
> value
> classes.)
>

It's more than just that: it's what unifies all classes together! They all
define a reference type, always with the same name as the class. That's
nice, unchanging solid ground under our feet while all the Valhalla shifts
are going on.

It would make more sense to me if `ArrayCursor.ref` were the alias to
`ArrayCursor`, and it would be appropriate for the reader to wonder "why do
we even need that alias?".

> The value type is called `ArrayCursor.val`, and the two types have the
> same conversions between them as primitives do today with their boxes.  The
> default value of the value type is the one for which all fields take on
> their
> default value; the default value of the reference type is, like all
> reference
> types, null.  We will refer to the value type of a value class as the
> _value
> companion type_.
>

... because it acts as a companion to the reference type you've always
known.
(At least, *I* still really don't want people to think that both the value
type and the reference types are "companions" to the class that defined
them.)

Just as with today's primitives and their boxes, the reference and value
> companion types of a value class differ in their support for nullity,
> polymorphism, treatment of uninitialized variables, and safety guarantees
> under
> race.  Value companion types, like primitive types, are monomorphic,
> non-nullable, tolerate uninitialized (zero) values, and (under some
> circumstances) may tear under racy access.  Reference types are
> polymorphic,
> nullable, and offer the initialization safety guarantees for final fields
> that
> we have come to expect from identity objects.
>
> Unlike with today's primitives, the "boxing" and "unboxing" conversions
> between
> the reference and value companion types are not nearly as heavy or
> wasteful,
> because of the lack of identity.  A variable of type `Point.val` holds a
> "bare"
> value object; a variable of type `Point.ref` holds a _reference to_ a value
> object.  For many use cases, the reference type will offer good enough
> performance; in some cases, it may be desire to additionally give up the
> affordances of reference-ness to make further flatness and footprint
> gains.  See
> [Performance Model](05-performance-model) for more details on the specific
> tradeoffs.
>
> In our diagram, these new types show up as another entity that straddles
> the
> line between primitives and identity-free references, alongside the legacy
> primitives:
>
> ** UPDATE DIAGRAM **
>
> <figure>
>   <a href="field-type-zoo.pdf" title="Click for PDF">
>     <img src="field-type-zoo-new.png" alt="Java field types with extended
> primitives"/>
>   </a>
> </figure>
>
> ### Member access
>
> Both the reference and value companion types have the same members.
>

Maybe worth acknowledging "(even those, like `wait()` inherited from
`Object`, that don't make sense and will fail at runtime, for simplicity's
sake)".

> Unlike
> today's primitives, value companion types can be used as receivers to
> access
> fields and invoke methods (subject to the usual accessibility
> constraints):
>
> ```
> Point.val p = new Point(1, 2);
> assert p.x == 1;
>
> p = p.scale(2);
> assert p.x == 2;
> ```
>

I think it is worth acknowledging that this does lead to `5.toString()`
becoming valid and functioning code, which happens just for consistency and
not because it was a goal in itself.

> ### Polymorphism
>
> An identity class `C` that extends `D` sets up a subtyping (is-a)
> relationship
> between `C` and `D`.  For value classes, the same thing happens between its
>  _reference type_ and the declared supertypes.  (Reference types are
>  polymorphic; value types are not.)  This means that if we declare:
>
> ```
> value class UnsignedShort extends Number
>                           implements Comparable<UnsignedShort> {
>    ...
> }
> ```
>
> then `UnsignedShort` is a subtype of `Number` and
> `Comparable<UnsignedShort>`,
> and we can ask questions about subtyping using `instanceof` or pattern
> matching.
> What happens if we ask such a question of the value companion type?
>
> ```
> UnsignedShort.val us = ...
> if (us instanceof Number) { ... }
> ```
>
> Since subtyping is defined only on reference types, the `instanceof`
> operator
> (and corresponding type patterns) will behave as if both sides were lifted
> to
> the appropriate reference type (unboxed), and then we can appeal to
> subtyping.
> (This may trigger fears of expensive boxing conversions, but in reality no
> actual allocation will happen.)
>
> We introduce a new relationship between types based on `extends` /
> `implements`
> clauses, which we'll call "extends": we define `A extends B` as meaning `A
> <: B`
> when A is a reference type, and `A.ref <: B` when A is a value companion
> type.
> The `instanceof` relation, reflection, and pattern matching are updated to
> use
> "extends".
>
> ### Array covariance
>
> Arrays of reference types are _covariant_; this means that if `A <: B`,
> then
> `A[] <: B[]`.  This allows `Object[]` to be the "top array type" -- but
> only for
> arrays of references.  Arrays of primitives are currently left out of this
> story.   We unify the treatment of arrays by defining array covariance
> over the
> new "extends" relationship; if A _extends_ B, then `A[] <: B[]`.  This
> means
> that for a value class P, `P.val[] <: P.ref[] <: Object[]`; when we
> migrate the
> primitive types to be value classes, then `Object[]` is finally the top
> type for
> all arrays.  (When the built-in primitives are migrated to value classes,
> this
> means `int[] <: Integer[] <: Object[]` too.)
>

I think it's worth addressing that this does mean there will be `Integer[]`
and `Object[]` instances that can't store null, failing at runtime, but
that this is consistent with the existing quirks of array covariance.

### Equality
>
> For values, as with primitives, `==` compares by state rather than by
> identity.
> Two value objects are `==` if they are of the same type and their fields
> are
> pairwise equal, where equality is defined by `==` for primitives (except
> `float`
> and `double`, which are compared with `Float::equals` and `Double::equals`
> to
> avoid anomalies), `==` for references to identity objects, and recursively
> with
> `==` for references to value objects.  In no case is a value object ever
> `==` to
> an identity object.
>
> When comparing two object _references_ with `==`, they are equal if they
> are
> both null, or if they are both references to the same identity object, or
> they
> are both references to value objects that are `==`.  (When comparing a
> value
> type with a reference type, we treat this as if we convert the value to a
> reference, and proceed as per comparing references.)  This means that the
> following will succeed:
>
> ```
> Point.val p = new Point(3, 4);
> Point pr = p;
> assert p == pr;
> ```
>
> The base implementation of `Object::equals` delegates to `==`, which is a
> suitable default for both reference and value classes.
>

This is where you could appeal to the idea that `==` has always meant
"strictly indistinguishable by any means" and this preserves that meaning
(modulo float/double weirdness).

### Serialization
>
> If a value class implements `Serializable`, this is also really a statement
> about the reference type.  Just as with other aspects described here,
> serialization of value companions can be defined by converting to the
> corresponding reference type and serializing that, and reversing the
> process at
> deserialization time.
>

It's nonobvious to me why the reference type is being elevated as the
primary one here, except that of course a method like `writeObject` is only
going to be fed the reference type. I would have expected just that
serializability applies equally to both types in the same way, much like
invoking some method on both types.

Serialization currently uses object identity to preserve the topology of an
> object graph.  This generalizes cleanly to objects without identity,
> because
> `==` on value objects treats two identical copies of a value object as
> equal.
> So any observations we make about graph topology prior to serialization
> with
> `==` are consistent with those after deserialization.
>
> ## Refining the value companion
>
> Value classes have several options for refining the behavior of the value
> companion type and how they are exposed to clients.
>
> ### Classes with no good default value
>
> For a value class `C`, the default value of `C.ref` is the same as any
> other
> reference type: `null`.  For the value companion type `C.val`, the default
> value
> is the one where all of its fields are initialized to their default value
> (0 for
> numbers, false for boolean, null for references.)
>
> The built-in primitives reflect the design assumption that zero is a
> reasonable
> default.  The choice to use a zero default for uninitialized variables was
> one
> of the central tradeoffs in the design of the built-in primitives.  It
> gives us
> a usable initial value (most of the time), and requires less storage
> footprint
> than a representation that supports null (`int` uses all 2^32 of its bit
> patterns, so a nullable `int` would have to either make some 32 bit signed
> integers unrepresentable, or use a 33rd bit).  This was a reasonable
> tradeoff
> for the built-in primitives, and is also a reasonable tradeoff for many
> other
> potential value classes (such as complex numbers, 2D points, half-floats,
> etc).
>

You might not want to go into the following. But I hope that users will
understand that the numeric types really do clear a pretty high bar here.
They are fortunate that for the *two* most popular reduction operations
over those types, zero happens to be the correct identity for one of them,
and absolutely destructive to the other (i.e., making it at least easy to
detect the bug). If not for *both* of those facts we would have more and
worse bugs in the world.

But for other potential value classes, such as `LocalDate`, there simply
> _is_ no
> reasonable default.  If we choose to represent a date as the number of days
> since some some epoch, there will invariably be bugs that stem from
> uninitialized dates; we've all been mistakenly told by computers that
> something
> that never happened actually happened on or near 1 January 1970.  Even if
> we
> could choose a default other than the zero representation as a default, an
> uninitialized date is still likely to be an error -- there simply is no
> good
> default date value.
>
> For this reason, value classes have the choice of _encapsulating_ their
> value
> companion type.  If the class is willing to tolerate an uninitialized
> (zero)
> value, it can freely share its `.val` companion with the world; if
> uninitialized
> values are dangerous (such as for `LocalDate`), the value companion can be
> encapsulated to the class or package, and clients can use the reference
> companion.  Encapsulation is accomplished using ordinary access control.
> By
> default, the value companion is `private` to the value class (it need not
> be
> declared explicitly); a class that wishes to share its value companion more
> broadly can do so by declaring it explicitly:
>
> ```
> public value record Complex(double real, double imag) {
>     public value companion Complex.val;
> }
> ```
>

I think you should add that the name `Complex.val` can't be changed here,
much like you can't change the name of a constructor even though it *looks*
like you could.

-- 
Kevin Bourrillion | Java Librarian | Google, Inc. | kevinb at google.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/valhalla-spec-observers/attachments/20220727/78545ec3/attachment-0001.htm>