Value type companions, encapsulated

Sun Jul 3 03:24:19 UTC 2022

In this message Brian wrote out the major features
of an emerging design for value classes:

> From: Brian Goetz <brian.goetz at oracle.com>
> To: … <valhalla-spec-experts at openjdk.java.net>
> Subject: Re: User model stacking: current status
> Date: Thu, 23 Jun 2022 15:01:24 -0400

I think controlling the complexity by having a separate
nested declaration of the value companion type will
work very well.

So what exactly does a private value companion do?
What is it you can and cannot do with this type?
What problems are prevented by privatizing it?
How and when is privatization enforced?
What other problems are created by those new rules?

I have been pulling on this thread for a few days
now, and I think I have some answers.

http://cr.openjdk.java.net/~jrose/values/encapsulating-val.md
http://cr.openjdk.java.net/~jrose/values/encapsulating-val.html

(The Hitchhiker’s Guide suddenly comes to mind.  Don’t panic!)

I expect I will be editing these files as we go.
For reference here is a verbatim copy of the MD file
as it stands right now (minus the header):

## Background

_(We will start with background information.  The **[new stuff comes
afterward]**.  Impatient readers can find a very quick **[summary of
restrictions]** at the end.)_

[new stuff comes afterward]: <#privatization-to-the-rescue>
[summary of restrictions]: <#summary-of-restrictions>

### Affordances of `C.ref`

Every class or interface `C` comes with a companion type, the
reference type `C.ref` derived from `C` which describes any variable
(argument, return value, array element, etc.) whose values are either
null or of a concrete class derived from `C`.  We are not in the habit
of distinguishing `C.ref` from `C`, but the distinction is there.  For
example, if we call `Object::getClass` on a variable of type `C.ref`
we might not get `C.class`; we might even get a null pointer
exception!

We are so very used to working with reference types (for short,
_ref-types_) that we sometimes forget all that they do for us
in addition to their linkage to specific classes:

  - `C.ref` gives a starting point for accessing `C`'s members.
  - `C.ref` provides abstraction:  `C` or a subtype might not be loaded 
yet.
  - `C.ref` provides the standard uninitialized value `null`.
  - `C.ref` can link `C` objects into graphs, even circular ones.
  - `C.ref` has a known size, one "machine word", carefully tuned by the 
JVM.
  - `C.ref` allows a single large object to be shared from many 
locations.
  - `C.ref` with an identity class can centralize access to mutable 
state.
  - `C.ref` values uniformly convert to and from general types like 
`Object`.
  - `C.ref` variable types can be reflected using `Class` mirror 
objects.
  - `C.ref` is safe for publication if the fields of `C` are `final`.

When I store a bunch of `C` objects into an object array or list, sort
it, and then share it with another thread, I am using several of the
above properties; if the other thread down-casts the items to `C.ref`
and works on them it relies on those properties.

If I implement `C` as a doubly-linked list data structure or a
(alternatively) a value-based class with tree structure, I am using
yet more of the above properties of references.

If my `C` object has a lot of state and I pass out many pointers to
it, and perhaps compute and cache interesting values in its mutable
fields, I am again relying on the special properties of references,
as well as of identity classes (if fields are mutable).

By the way, in the JVM, variables of type `C.ref` (some of them at
least) are associated not with `C` simple, but with the so-called
_L-descriptor_ spelled `LC;`.  When we talk about `C.ref` we are
usually talking about those L-descriptors in the JVM, as well.

I don't need to think much about this portfolio of properties as I go
about my work.  But if they were to somehow fail, I would notice bugs
in my code sooner or later.

One of the big consequences of this overall design is that I can write
a class `C` which has full control over its instance states.  If it is
mutable, I can make its fields private and ensure that mutations occur
only under appropriate locking conditions.  Or if I declare it as a
value-based class, I can ensure that its constructor only allows
legitimate instances to be constructed.  Under those conditions, I
know that every single instance of my class will have been examined
and accepted by the class constructor, and/or whatever factory and
mutator methods I have created for it.  If I did my job right, not
even a race condition can create an invalid state in one of my
objects.

Any instance state of `C` which has been reached without being
produced from a constructor, factory, mutator, or constant of `C` can
be called _non-constructed_.  Of course, inside a class any state
whatever can be constructed, subject to the types of fields and so on.
But the author of the class gets to decide which states are
legitimate, and the decisions are enforced by access control at the
boundaries of the encapsulation.

So if I code my class right, using access control to keep bad states
away from my clients, my class's external API will have no
non-constructed states.

### Costs of `C.ref`

In that case why have value types at all, if references are so
powerful?  The answer is that reference-based abstraction pays for its
benefits with particular costs, costs that Java programmers do not
always wish to pay:

   - A reference (usually) requires storage for a pointer to the object.
   - A reference (usually) requires storage for a header embedded inside 
the object.
   - Access to an object's fields (usually) requires extra cycles to 
chase the pointer.
   - The GC expends effort administering a singular "home location" for 
every object.
   - Cache line invalidation near that home location can cause useless 
memory traffic.
   - A reference must be able to represent `null`; tightly-packed types 
like `int` and `long` would need to add an extra bit somewhere to cover 
this.

The major alternative to references, as provided by Valhalla, is flat
objects, where object fields are laid out immediately in their
containers, in place of a pointer which points to them stored
elsewhere.  Neither alternative is always better than the other, which
is why Java has both `int` and `Integer` types and their arrays, and
why Valhalla will offer a corresponding choice for value classes.

### Alternative affordances of `C.val`

Now, instances of a value class can be laid out flat in their
containing variables.  But they can also be "boxed" in the heap, for
classic reference-based access.  Therefore, a value class `C` has not
one but _two_ companion types associated it, not only the reference
companion `C.ref` but also the value companion `C.val`.  Only value
classes have value companions, naturally.  The companion `C.val` is
called a value type (or _val-type_ for short), by contrast with any
reference type, whether `Object.ref` or `C.ref`.

The two companion types are closely related and perform some of the
same jobs:

  - `C.ref` and `C.val` both give a starting point for accessing `C`'s 
members.
  - `C.ref` and `C.val` can link `C` objects into acyclic graphs.
  - `C.ref` and `C.val` values uniformly convert to and from general 
types like `Object`.
  - `C.ref` and `C.val` variable types can be reflected using `Class` 
mirror objects.

For these jobs, it usually doesn't matter which type companion does
the work.

Despite the similarities, many properties of a value companion type
are subtly different from any reference type:

  - `C.val` is non-abstract:  You must load its class file before making 
a variable.
  - `C.val` cannot nest except by reference; `C` cannot declare a 
`C.val` field.
  - `C.val` does not represent the value `null`.
  - `C.val` is routinely flattenable, avoiding headers and indirection 
pointers
  - `C.val` has configurable size, depending on `C`'s non-static fields.
  - `C.val` heap variables (fields, array elements) are initialized to 
all-zeroes.
  - `C.val` might not be safe for publication (even though its fields 
are `final`).

The JVM distinguishes `C.val` by giving it a different descriptor, a
so-called _Q-descriptor_ of the form `QC;`, and it also provides a
so-called _secondary mirror_ `C.val.class` which is similar to the
built-in primitive mirrors like `int.class`.

As the Valhalla performance model notes, flattening may be expected
but is not fully guaranteed.  A `C.val` stored in an `Object`
container is likely to be boxed on the heap, for example.  But `C.val`
objects created as bytecode temporaries, arguments, and return values
are likely to be flattened into machine registers, and `C.val` fields
and array elements (at least below certain size thresholds) are also
likely to be flattened into heap words.

As a special feature, `C.ref` is potentially flattenable if `C` is a
value class.  There are additional terms and conditions for flattening
`C.ref`, however.  If `C` is not yet loaded, nothing can be done:
Remember that reference types have full abstraction as one of their
powers, and this means building data structures that can refer to them
even before they are loaded.  But a class file can request that the JVM
"peek" at a class to see if it is a value class, and if this request
is acted on early enough (at the JVM's discretion), then the JVM can
choose to lay out some or all `C.ref` values as flattened `C.val`
values _plus_ a boolean or other sentinel value which indicates the
`null` state.

### Pitfalls of `C.val`

The advantages of value companion types imply some complementary
disadvantages.  Hopefully they are rarely significant, but they
must sometimes be confronted.

  - `C.val` might need to load a class file which is somehow unloadable
  - `C.val` will fail to load if its instance layout directly _or 
indirectly_ includes a `C.val` field _or subfield_
  - `C.val` will throw an exception if you try to assign a `null` to it.
  - `C.val` may have surprising costs for multi-word footprint and 
assignment (and so might `C.ref` if that is flattened)
  - `C.val` is initialized to its all-zero value, which might be 
non-constructed
  - `C.val` might allow data races on its components, creating values 
which are non-constructed

The footprint issue shows up most strongly if you have many copies of
the same `C.val` value; each copy will duplicate all the fields, as
opposed many copies of the same `C.ref` reference, which are likely to
all point to a single heap location with one copie of all the fields.

Flat value size can also affect methods like `Arrays.sort`, which
perform many assignments of the base type, and must move all fields on
each assignment.  If a `C.val` array has many words per element, then
the costs of moving those words around may dominate a sort request.
For array sorting there are ways to reduce such costs transparently,
but it is still a "law of physics" that editing a whole data structure
will have costs proportional to the size of the edited portions of the
data structure, and `C.ref` arrays will often be somewhat more compact
than `C.val` arrays.  Programmers and library authors will have to use
their heads when deciding between the new alternatives given by value
classes.

But the last two pitfalls are hardest to deal with, because they both
have to do with non-constructed states.  These states are the all-zero
state with the second-to-last pitfall, and (with the last pitfall) the
state obtained by mixing two previous states by means of a pair of
racing writes to the same mutable `C.val` variable in the heap.
Unlike reference types, value types can be manipulated to create these
non-constructed states even in well-designed classes.

Now, it may be that a constructor (or factory) might be perfectly able
to create one of the above non-constructed states as well, no strings
attached.  In that case, the class author is enforcing few or no
invariants on the states of the value class.  Many numeric classes,
like complex numbers, are like this: Initialization to all-zeroes is
no problem, and races between components are acceptable, compared to
the costs of excluding races.

> (The reader may recall that early JVMs accepted races on the high
and low halves of 64-bit integers as well; this is no longer a
widespread issue, but bigger value types like complex raise the same
issue again, and we need to provide class authors the same solution,
if it fits their class.)

There are also some classes for which there are no good defaults, or
for which a good default is definitely not the all-zero bit pattern.
Authors of such types will often wish to make that bit pattern
inaccessible to their clients and provide some factory or constant
that gives the real default.  We expect that such types will choose
the `C.ref` companion, and rely on the extra null checks to ensure
correct initialization.

Other classes may need to avoid other non-constructed values that may
arise from data races, perhaps for reasons of reliability or security.
This is a subtle trade-off; very few class authors begin by asking
themselves about the consequences of data races on mutable members,
and even fewer will ask about _races on whole instances_ of value
types, especially given that fields in value types are always
immutable.  For this reason, we will set safety as the default, so
that a class (like complex numbers) which is willing to tolerate data
races must declare its tolerance explicitly.  Only then will the JVM
drop the internal costs of race exclusion.

Whether to tolerate the all-zero bit pattern is a simpler decision.
Still, it turns out to be useful to give a common single point of
declarative control to handle _all_ non-constructed states, both
the default value of `C.val` and its mysterious data races.

## Privatization to the rescue

_(Here are the important details about the encapsulation of value
types.  The impatient reader may enjoy the very quick **[summary of
restrictions]** at the end of this document.)_

In order to hide non-constructed states, the value companion `C.val`
may be _privatized_ by the author of the class `C`.  A privatized
value companion is effectively withdrawn from clients and kept private
to its own class (and to nestmates).  Inside the class, the value
companion can be used freely, fully under control of the class author.

But untrusted clients are prevented from building uninitialized fields
or arrays of type `C.val`.  This prevents such clients from creating
(either accidentally or purposefully) non-constructed values of type
`C.val`.  How privatization is declared and enforced is discussed in
the rest of this document.

> (To review, for those who skipped ahead, non-constructed values are
those not created under control of the class `C` by constructors or
other accessible API points.  A non-constructed value may be either an
uninitialized variable of `C.val`, or the result of a data race on a
shared mutable variable of type `C.val`.  The class itself can work
internally with such values all day long, but we exclude external
access to them by default.)

### Atomicity as well

As a second tactic, a value class `C` may select whether or not the
JVM enforces atomicity of all occurrences of its value companion
`C.val`.  A non-atomic value companion is subject to data races, and
if it is not privatized, external code may misuse `C.val` variables
(in arrays or mutable fields) to create non-constructed values via
data races.

A value companion which is atomic is not subject to data races.  This
will be the default if the the class `C` does not explicitly request
non-atomicity.  This gives safety by default and limits
non-constructed states to only the all-zero initial value.  The
techniques to support this are similar to the techniques for
implementing non-tearing of variables which are declared `volatile`;
it is as if every variable of an atomic value variable has some (not
all) of the costs of volatility.

The JVM is likely to flatten such an atomic value only up to the
largest available atomically settable memory unit, usually 128 bits.
Values larger than that are likely to be boxed, or perhaps treated
with some other expensive transactional technique.  Containers that
are immutable can still be fully flattened, since they are not subject
to data races.

The behavior of an atomic `C.val` is aligned with that of `C.ref`.  A
reference to a value class `C` _never_ admits data races on `C`'s
fields.  The reason for this is simple: A `C.ref` value is a `C.val`
instance boxed on the heap in a single immutable box-class field of
type `C.val`.  (Actually, the JVM may partially or wholly flatten the
representation of `C.ref` if it can get away with it; full flattening
is likely for JVM locals and stack values, but any such secret
flattening is undetectable by the user.)  Since it is `final` all the
way down (to `C`'s fields) any `C.ref` value is safely published
without any possibility of data races.  Therefore, an extra
declaration of non-atomicity in `C` affects only the value companion
`C.val`.

It seems that there are use cases which justify all four combinations
of both choices (privatization and declared non-atomicity), although
it is natural to try to boil down the size of the matrix.

   - `C.val` private & atomic is the default, and safest configuration
   hiding all non-constructed values outside of `C` and all data races
   even inside of `C`.  There are some runtime costs.

   - `C.val` public & non-atomic is the opposite, with fewer runtime
   costs.  It must be explicitly declared.  It is desirable for
   numerics like complex numbers, where all possible bitwise states are
   meaningful.  It is analogous to the situation of a naturally
   non-atomic primitive like `long`.

   - `C.val` public & atomic allows everybody to see the all-zero
   initial value but no other non-constructed states.  This is
   analogous to the situation of a naturally atomic primitive like
   `int`.

   - `C.val` private & non-atomic allows `C` complete control over the
   visibility of non-constructed states, but `C` also has the ability
   to work internally on arrays of non-atomic elements.  `C` should
   take care not to leak internally-created flat arrays to untrusted
   clients, lest they use data races to hammer non-constructed values
   into those arrays.

It is logically possible, but there does not seem to be a need, for
allowing a single class `C` to work with both kinds of arrays, atomic
and non-atomic.  (In principle, the dynamic typing of Java arrays
would support this, as long as each array was configured at its
creation.)  The effect of this can be simulated by wrapping a
non-atomic class `C` in another wrapper class `WC` which is atomic.
Then `C.val[]` arrays are non-atomic and `WC.val[]` arrays are atomic,
yet each kind of array can have the same "payload", a repeated
sequence of the fields of `C`.

## Privatization in code

For source code and bytecode, privatization is enforced by performing
access checks on names.

### Privatization rules in the language

We will stipulate that a value class `C` _always_ has a value
companion type `C.val`, even if it is never declared or used.  And we
give the author of `C` some control over how clients may use the type
`C.val`, in a manner roughly similar to nested member classes like
`C.M`.

Specifically, the declaration of `C` always selects an access mode for
its value companion `C.val` from one of the following three choices:

   - `C.val` is declared private
   - `C.val` is declared public
   - `C.val` is declared, but neither public nor private

If `C.val` is declared private, then only nestmates of `C` may access
`C.val`.  If it is neither public nor private, only classes in the
same runtime package as `C` may access it.  If it is declared public,
then any class that can access `C` may also access `C.val`.

As an independent choice, the declaration of `C may select an atomicity
for its value companion `C.val` from one of the following two choices:

   - `C.val` is explicitly declared non-atomic
   - `C.val` is not explicitly declared non-atomic, and is thus atomic

If there is no explicit access declaration for `C.val` in the code of
`C`, then `C.val` is declared private and atomic.  That is, we set the
default to the safest and most restrictive choice.

In source code, these declarations are applied to explicit occurrences
of the type name `C.val`.  The access modification of `C.val` is also
transferred to the implicitly declared name `C.default`

The syntax looks like this:

```
class C {
   //only one of the following lines may be specified
   //the first line is the default
   private value companion C.val;  //nestmates only
   value companion C.val;          //package-mates only
   public value companion C.val;   //all may access
   // the non-atomic modifier may be present:
   private non-atomic value companion C.val;
   public non-atomic value companion C.val;
   non-atomic value companion C.val;
}
```

When a type name `C.val` or an expression `C.default` is
used by a class `X`, there are two access checks that occur.  First,
access from `X` to the class `C` is checked according to the usual
rules of Java.  If access to `C` is permitted, a second check is done
if the companion is not declared `public`.  If the companion is
declared `private`, then `X` and `C` must be nestmates, or else access
will fail.  If the companion is neither `public` nor `private`, then
`X` and `C` must be in the same package, or else access will fail.

### Example privatized value companion

Here is an example of a class which refuses to construct its default
value, and which prevents clients from seeing that state:

```
class C {
   int neverzero;
   public C(int x) {
     if (x == 0)  throw new IllegalArgumentException();
     neverzero = x;
   }
   public void print() { System.out.println(this); }

   private value companion C.val;  //privatized (also the default)

   // some valid uses of C.val follow:
   public C.val[] flatArray() { return new C.val[]{ this }; }
   private static C.ref nonConstructedZero() {
     return (new C.val[1])[0];  //OK:  C.val private but available
   }
   public static C.ref box(C.val val) { return val; }  //OK param type
   public C.val unbox() { return this; }  //OK return type

   // valid use of private C.default, with Lookup negotiation
   public static
   C.ref defaultValue(java.lang.reflect.MethodHandles.Lookup lookup) {
     if (!lookup.in(C.class).hasFullPrivilegeAccess())
       return null;     //…or throw
     return C.default;  //OK: default for me and maybe also for thee
   }
}

// non-nestmate client:
class D {
   static void passByValue(C x) {
     C.ref ref = box(x);   //OK, although x is null-checked
     if (false)  box((C.ref) null);  //would throw NPE
     assert ref == x;
   }

   static Object useValue(C x) {
     x.unbox().print();   //OK, invoke method on C.val expression
     var xv = x.unbox();  //OK, although C.val is non-denotable
     xv.print();          //OK
     //> C.val xv = x.unbox();  //ERROR: C.val is private
     return xv;  //OK, originally from legitimate method of C
   }

   static Object arrays(C x) {
     var a = x.flatArray();
     //> C.val[] va = a;  //ERROR: C.val is private
     Arrays.toString(a);  //OK
     C.ref[] a2 = a;      //covariant array assignment
     C.ref[] na = new C.ref[1];
     //> na = new C.val[1];  //ERROR: C.val is private
     return a[0];  //constructed values only
   }
}
```

The above code shows how a privatized value companion can and cannot
be used.  The type name may never be mentioned.  Apart from that
restriction, client code can work with the value companion type as it
appears in parameters, return values, local variables, and array
elements.  In this, a privatized companion behaves like other
non-denotable types in Java.

> **Rationale:** Note that a companion type is _not_ a real class.
Therefore it cannot appeal, precisely, to the existing provisions (in
JLS or JVMS) for enforcing class accessibility.  But because it is a
type, and today _nearly all types are classes_ (and interfaces), users
have a right to expect that encapsulation of companion types will
"feel like" encapsulation of type names.  More precisely, users will
hope to re-use their knowledge about how type name access works when
reasoning about companion types.  We aim to accommodate that hope.  If
it works, users won't have to think very often about the class-vs-type
distinction.  That is why the above design emulates pre-existing
usage patterns for non-denotable types.

### Privatization in translation

When a value class is compiled to a class file, some metadata is
included to record the explicit declaration or implicit status of the
value companion.

The access selection of `C`'s value companion (public, package,
private) is encoded in the `value_flags` field of the `ValueClass`
attribute of the class information in the class file of `C`.

The `value_flags` field (16 bits) has the following legitimate values:

   - zero: `C.val` default access, non-atomic
   - `ACC_PUBLIC`: `C.val` public access, non-atomic
   - `ACC_PRIVATE`: `C.val` private access, non-atomic
   - `ACC_VOLATILE`: `C.val` default access, atomic
   - `ACC_VOLATILE|ACC_PUBLIC`: `C.val` public access, atomic
   - `ACC_VOLATILE|ACC_PRIVATE`: `C.val` private access, atomic

Other values are rejected when the class file is loaded.

(**JVM ISSUE #0:** Can we kill the `ACC_VALUE` modifier bit?  Do we
really care that `jlr.Modifiers` kind-of wants to own the reflection
of the contextual modifier `value`?  Who are the customers of this
modifier bit, as a bit?  John doesn't care about it personally, and
thinks that if we are going to have an attribute we can get rid of the
flag bit.  One implementation issue with killing `ACC_VALUE` is that
class modifiers are processed very late during class loading, while
class modifiers are processed very early.  It may be easier to do some
kinds of structural checks on the fly during class loading even before
class attributes are processed.  Yet this also seems like a poor
reason to use a modifier bit.)

(**JVM ISSUE #1:** What if the attribute is missing; do we reject the
class file or do we infer `value_flags=ACC_PRIVATE|ACC_VOLATILE`?
Let's just reject the file.)

(**JVM ISSUE #2:** Is this `ValueClass` attribute really a good place
to store the "atomic" bit as well?  This attribute is a green-field
for VM design, as opposed to the brown-field of modifier bits.  The
above language assumes the atomic bit belongs in there as well.)

A use of a value companion `C.val`, in any source file, is generally
translated to a use of a Q-descriptor `QC;`:

   - a field declaration of `C.val` translates to a field-info with a 
Q-descriptor
   - a method or constructor declaration that mentions `C.val` mentions 
a corresponding Q-descriptor in its method descriptor
   - a use of a field resolves a `CONSTANT_Fieldref` with a Q-descriptor 
component
   - a use of a method or constructor uses a `CONSTANT_Methodref` (or 
`CONSTANT_InterfaceMethodref`) with a Q-descriptor component
   - a `CONSTANT_Class` entry main contain a Q-descriptor or an array 
type whose element type is a Q-descriptor
   - a verifier type record may refer to `CONSTANT_Class` which contains 
a Q-descriptor

Privatization is enforced for these uses only as much as is needed to
ensure that classes cannot create unintiialized values, fields, and
arrays.

If an access from bytecode to a privatized Q-descriptor fails, an
exception is thrown; its type is `IllegalAccessError`, a subtype of
`IncompatibleClassChangeError`.  Generally speaking such an exception
diagnoses an attempt by bytecode to make an access that would have
been prevented by the static compiler, if the Java source program had
been compiled together as a whole.

When a field of Q-descriptor type is declared in a class file, the
descriptor is resolved early, before the class is linked, and that
resolution includes an access check which will fail unless the class
being loaded has access to `C.val`, as determined by loading `C` and
inspecting its `ValueClass` attribute.  These checks prevent untrusted
clients of `C` from created non-constructed zero values, in any of
their fields.

The timing of these checks, on fields, is aligned with the internal
logic of the JVM which consults the class file of `C` to answer other
related questions about field types: (a) whether `C` is in fact a
value class, and (b) what is the layout of `C.val`, in case the JVM
wishes to flatten the value in a containing field.  The third check
(c) is `C.val` companion accessible happens at the same time.  This is
early during class loading for non-static fields, and during
class preparation for static fields.

Privatization is _not enforced_ for non-field Q-descriptors, that
occur in method and constructor signatures, and in state descriptions
for the verifier.  This is because mere use of Q-descriptors to
describe pre-existing values cannot (by itself) expose non-constructed
values, when those values are on stack or in locals.

> This can happen invisible at the source-code level as well.  An API
might be designed to return values of a privatized type from its
methods or fields, and/or accept values of a privatized type into its
methods, constructors, or fields.  In general, the bytecode for a
client of such an API will work with a mix of Q-descriptor and
L-descriptor values.

The verifier's type system uses field descriptor types, and thus can
"see" both Q-descriptors and L-descriptors.  Clients of a class with a
privatized companion are likely to work mostly with L-descriptor
values but may also have Q-descriptor values in locals and on stack.

When feeding an L-descriptor value to an API point that accepts a
Q-descriptor, the verifier needs help to keep the types straight.  In
such cases, the bytecode compiler issues `checkcast` instructions to
adjust types to keep the verifier happy, and in this case the operand
of the checkcast would be of the form `CONSTANT_Class["QC;"]`.

(**JVM ISSUE #3:** The Q/L distinction in the verifier helps the
interpreter avoid extra dynamic null checks around `putfield`,
`putstatic`, and the `invoke` instructions.  This distinction requires
an explicit bytecode to fix up Q/L mismatches; the `checkcast`
bytecode serves this purpose.  That means checkcast requires the
ability to work with privatized types.  It requires us to make the
dynamic permission check when other bytecodes try to use the
privatized type.  All this seems acceptable, but we could try to make
a different design which `CONSTANT_Class` resolution fails immediately
if it contains an inaccessible Q-descriptor.  That design might
require a new bytecode which does what `checkcast` does today on a
Q-descriptor.)

Meanwhile, arrays are rich sources of non-constructed zero values.
They appear in bytecode as follows:

   - A `C.val[]` array construction uses `anewarray` with a 
`CONSTANT_Class` type for the Q-descriptor; this is new to Valhalla.
   - Such an array construction may also use `multianewarray` with an 
appropriate array type.
   - An array element is read from heap to stack by `aaload`; the 
verifier type of the stacked value is copied from the verifier type of 
the array itself.
   - An array element is written from stack to heap by `aastore`; the 
verifier type of the stored value is merely constrained to the type 
`Object`.

Note that there are no static type annotations on array access
instruction.  The practical impact of this is that, if an array of a
privatized type `C.val` is passed outside of `C`, then any values in
that array become accessible outside of `C`.  Moreover, if `C.val` is
non-atomic, clients may be able to inflict data races on the array.

Thus, the best point of control over misuse of arrays is their
_creation_, not their _access_.  Array creation is controlled by
`CONSTANT_Class` constant pool entries and their access checking.
When an `anewarray` or `multianewarray` tries to create an array,
the `CONSTANT_Class` constant pool entry it uses must be consulted
to see if the element type is privatized and inaccessible to the
current class, and `IllegalAccessError` thrown if that is the case.

All this leads to special rules for resolving an entry of the form
`CONSTANT_Class["QC;"]`.  When resolving such a constant, the class
file for `C` is loaded, and `C` is access checked against the current
class.  (This is just what happens when `CONSTANT_Class["C"]` gets
resolved.)  Next, the `ValueClass` attribute for `C` is examined; it
must exist, and if it indicates privatization of `C.val`, then access
is checked for `C.val` against the current class.

If that access to a privatized companion would fail, no exception is
thrown, but the constant pool entry is resolved into a special
restricted state.  Thus, a resolved constant pool entry of the form
`CONSTANT_Class["QC;"]` can have the following states:

   - Error, because `C` is inaccessible or doesn't exist or is not a 
value class.
   - Full resolution, so `C.val` is ready for general use in the current 
class.
   - Restricted resolution, so `C.val` is ready for restricted use in 
the current class.

That last state happens when `C` is accessible but `C.val` is not.

Likewise, a constant pool entry of the form `CONSTANT_Class["[QC;"]`
(or a similar form with more leading array brackets) can have three
states, error, full resolution, and restricted resolution.

Pre-Valhalla `CONSTANT_Class` entries which do not mention
Q-descriptors have only two resolved states, error and full
resolution.

As required above, the `checkcast` bytecode treats full resolution and
restricted resolution states the same.

But when the `anewarray` or `multianewarray` instruction is executed,
it consults throws an access error if its `CONSTANT_Class` is not
fully resolved (either it is an error or is restricted).  This is how
the JVM prevents creation of arrays whose component type is an
inaccessible value companion type, even if the class file does
not correspond to correct Java source code.

Here are all the classfile constructs that could refer to a
`CONSTANT_Class` constant in the restricted state, and whether they
respect it (throwing `IllegalAccessError`):

   - `checkcast` ignores the restriction and proceeds
   - `instanceof` ignores the restriction (consistent with `checkcast`)
   - `anewarray` and `multianewarray` respect the restriction and throw
   - `ldc` throws (consistent with `C.val.class` in source code)
   - bootstrap arguments throw (consistent with `ldc`)
   - verifier types ignore the restriction and continue checking
   - **(FIXME: There must be more than this.)**

Q-descriptors not in `CONSTANT_Class` constants are naturally immune
to privatization restrictions.  In particular, `CONSTANT_Methodtype`
constants can successfully refer to mirrors to privatized companions.

Uses of `CONSTANT_Class` constants which forbid Q-descriptors and
their arrays are also naturally immune, since they will never
encounter a constant resolved in the restricted state.  These include
`new`, `aconst_init`, the class sub-operands of `CONSTANT_Methodref`
and its friends, exception catch-types, and various attributes like
`NestHost` and `InnerClasses`: All of the above are allowed to refer
only to proper classes, and not to their value companions or arrays.

Nevertheless, a `aconst_init` bytecode must throw an access error when
applied to a class with an inaccessible privatized value companion.
This is worth noting because the constant pool entry for `aconst_init`
does _not_ mention a Q-descriptor, unlike the array construction
bytecodes.

> Perhaps regular class constants of the form `CONSTANT["C"]` would
also benefit slightly from a restricted state, which would be
significant _only_ to the `aconst_init` bytecode, and ignored by all
the above "naturally immune" usages.  If a JVM implementation takes
this option, the same access check would be performed and recorded for
both `CONSTANT["C"]` and `CONSTANT["QC;"]`, but would be respected
only by `withvalue` (for the former) and `anewarray` and the other
cases noted above (for the latter but _not_ the former).  On the other
hand, the particular issue would become moot if `aconst_init`, like
`withfield`, were restricted to the nest of its class, because then
privatization would not matter.

The net effect of these rules, so far, is that neither source code nor
class files can directly make uninitialized variables of type `C.val`,
if the code or class file was not granted access to `C.val` via `C`.
Specifically, fields of type `C.val` cannot be declared nor can arrays
of type `C.val[]` be constructed.

This includes class files as correctly derived from valid source code
or as "spun" by dodgy compilers or even as derived validly from old
source code that has changed (and revoked some access).

> Remember that new nestmates can be injected at runtime via the
`Lookup` API, which checks access and then loads new code that enjoys
the same access.  The level of access depends in detail on the
selection of `ClassOption.NESTMATE` (for nestmate injection) or not
(for package-mate injection).  The JVM uses common rules for these
injected nestmates or package-mates and for normally compiled ones.

There are no restrictions on the use of `C.ref`, beyond the basic
access restrictions imposed by the language and JVM on the name `C`.
Access checks for regular references to classes and interfaces are
unchanged throughout all of the above.

There are more holes to be plugged, however.  It will turn out that
arrays are once again a problem.  But first let's examine how
reflection interacts with companion types and access control.

## Privatization and APIs

Beyond the language there are libraries that must take account of the
privatization of value companions.  We start on the shared boundary
between language and libraries, with reflection.

### Reflecting privatization

Every companion type is reflected by a Java class mirror of type
`java.lang.Class`.  A Java class mirror _also_ represents the class
underlying the type.  The distinction between the concept of class and
companion type is relatively uninteresting, except for a value class
`C`, which has two companion types and thus two mirrors.

In Java source code the expression `C.class` obtains the mirror for
both `C` and its companion `C.ref`.  The expression `C.val.class`
obtains the mirror for the value companion, if `C` is a value class.
Both expressions check access to `C` as a whole, and `C.val.class`
_also_ checks access to the value companion (if it was privatized).

But it is a generally recognized fact that Java class mirrors are less
secure than the Java class types that the mirrors represent.  It is
easy to write code that obtains a mirror on a class `C` without
directly mentioning the name `C` in source code.  One can use
reflective lookup to get such mirrors, and without even trying one may
also "stumble upon" mirrors to inaccessible classes and companion
types.  Here are some simple examples:

```
Class<?> lookup() {
   var name = "java.util.Arrays$ArrayList";
   //or name = "java.lang.AbstractStringBuilder";
   //> java.lang.invoke.MethodHandles.lookup().findClass(name);  //ERROR
   return Class.forName(name);  //OK!
}
Class<?> stumble1() {
   //> return java.util.Arrays.ArrayList.class;  //ERROR
   return java.util.Arrays.asList().getClass();  //OK!
}
Class<?> stumble2() {
   //> return java.lang.AbstractStringBuilder.class;  //ERROR
   return StringBuilder.class.getSuperclass();  //OK!
}
Class<?> stumble3() {
   //> return C.val.class;  //ERROR if C.val is privatized
   return C.ref.class.asValueType();  //OK!
}
```

Therefore, access checking class names is not and cannot be the whole
story for protecting classes and their companion types from reflective
misuse.  If a mirror is obtained that refers to an inaccessible
non-public class or privatized companion, the mirror will "defend
itself" against illegal access by checking whether the caller has
appropriate permissions.  The same goes for method, constructor, and
field mirrors derived from the class mirror: You can reflect a method
but when you try to call it all of the access checks (including the
check against the class) are enforced against you, the caller of the
reflective API.

> The checking of the caller has two possible shapes. Either a caller
sensitive method looks directly at its caller, or the call is
delegated through an API that requires negotiation with a
`MethodHandles.Lookup` object that was previously checked against a
caller.

Now, if a class `C` is accessible but its value companion `C.val` is
privatized, all of `C`'s public methods and other API points are
accessible (via both companion types), but access is limited to those
very specific operations that could create non-constructed instances
(via a variable of companion type `C.val`).  And this boils down
to a limitation on array creation.  If you cannot use either source
code or reflection to create an array of type `C.val[]`, then you
cannot create the conditions necessary to build non-constructed
instances.

Reflective APIs should be available to report the declared properties
of reference companions.  It is enough to add the following two methods:

   - `Class::isNonAtomic` is true only of mirrors of value companions
   which have been declared non-atomic.  On some JVM implementations it
   *may* additionally be true of `long.class` and/or `double.class`.

   - `Class::getModifiers`, when applied to a mirror of a value
   companion, will return a modifier bit-mask that reflects the
   declared access.  (This is compatible with the current behavior of
   HotSpot for primitive mirrors, which appear as if they were somehow
   declared `public`, with `abstract` and `final` thrown in to boot.)

(Note that most reflective access checking should take care to work
with the reference mirror, not the value mirror, as the modifier bits
of the two mirrors might differ.)

### Privatization and arrays

There are a number of standard API points for creating Java array
objects.  When they create arrays containing uninitialized elements,
then a non-constructed default value can appear.  Even when they
create properly initialized arrays, if the type is declared
non-atomic, then non-constructed values can be created by races.

   - `java.lang.reflect.Array::newInstance` takes an element mirror and 
length and builds an array.  The elements of the returned array are 
initialized to the default value of the selected element type.
   - `java.util.Arrays::copyOf` and `copyOfRange` can extend the length 
of an existing array to include new uninitialized elements.
   - A special overloading of `java.util.Arrays::copyOf` can request a 
different type of the new array copy.
   - `java.util.Collection::toArray` (an interface method) may extend 
the length of an existing array, but does not add uninitialized 
elements.
   - `java.lang.invoke.MethodHandles.arrayConstructor` creates a method 
handle that creates uninitialized arrays of a given type, as if by the 
`anewarray` bytecode.
   - The serialization API contains an operator for materializing arrays 
of arbitrary type from the wire format.

The basic policy for all these API points is to conservatively limit
the creation of arrays of type `C.val[]` if `C.val` is not public.

   - `java.lang.reflect.Array::newInstance` will throw
     `IllegalArgumentException` if the element type is privatized.
     (See below for a possible caller-sensitive enhancement.)

   - `java.util.Arrays::copyOf` and `copyOfRange` will throw instead of
     creating uninitialized elements, if the element type is
     privatized.  If only previously existing array elements are
     copied, there is no check, and this is a use common case (e.g., in
     `ArrayList::toArray`).

   - The special overloading of `java.util.Arrays::copyOf` will refuse
     to create an array of any non-atomic privatized type.  (This
     refusal protects against non-constructed values arising from data
     races.)  It also incorporates the restrictions of its sibling
     methods, against creating uninitialized elements (even of an
     atomic type).

   - `java.lang.invoke.MethodHandles.arrayConstructor` will refuse to
     create a factory method handle if the element type is privatized.

   - `java.util.Collection::toArray` needs implementation review; as it
     is built on top of the previous API points, it may possibly fail
     if asked to lengthen an array of privatized type.  Note that many
     methods of `toArray` use `Arrays.copyOf` in a safe manner, which
     does _not_ create uninitialized elements.

   - `java.util.stream.Stream::toArray`, the various `List::toArray`,
     and other clients of `Arrays::copyOf` or `Array::newInstance` need
     implementation review.  Where a generic API is involved, the
     assumption is often that non-flat reference arrays are being
     created, and in that case no outage is possible, since reference
     companion arrays can always be freely created.  For specialized
     generics with flat types, additional implementation work is
     required, in general, to ensure that flat arrays can be created by
     parties with the right to do so.

   - The serialization API should restrict its array creation operator.
     Serialization methods should not attempt to serialize flat arrays
     either.  It is enough to serialize arrays of the reference type.

**API ISSUE #1:** Should we relax construction rules for zero-length
arrays?  This would add complexity but might be a friendly move for
some use cases.  A zero-length array cannot expose non-constructed
values.  It may, however, serve as a misleading "witness" that some
code has gained permission to work with flat arrays.  It's safer to
disallow even zero-length arrays.

**API ISSUE #2:** What about public value companions of non-public
inaccessible classes?  In source code, we do not allow arrays of
private classes to be made, or of their their public value companions.
Should we be more permissive in this case?  We could specify that
where a value companion has to be checked against a client, its
original class gets checked as well; this would exclude some use cases
allowed by the above language, which only takes effect if the
companion is privatized.  An extra check for a public companion seems
like busy-work and a source of unnecessary surprises, though.  Let's
not.

There are probably legitimate use cases for arrays of privatized
types, with which the new restrictions on the above API points would
interfere.  So as a backup, we will make API adjustments to work with
privatized array types, with an extra handshake to perform the access
check (via either caller sensitivity or negotiation with an instance
of `MethodHandles.Lookup`).

   - `java.lang.reflect.Array::newInstance` should probably be made
      caller sensitive, so it can refrain from throwing if a privatized
      element type is accessible to the caller.  (Alternatively, a new
      caller-sensitive API point could made, such as
      `Array::newFlatInstance`.  But a new API point seems unnecessary
      in this case, and caller-sensitivity is common practice in this
      method's package.)  Note that, as is typical of core reflection
      API points, _many uses_ of `newInstance` will not benefit from
      the caller sensitivity.

   - `java.util.Arrays::copyOf` and `copyOfRange` may be joined by
     additional "companion friendly" methods of a similar character
     which fill new array elements with some other specified fill
     value, and/or which cyclically replicate the contents of the
     original array, and/or which call a functional interface to
     provide missing elements.  The details of this are a matter for
     library designers to decide.  Adding caller sensitivity to
     these API points is probably the wrong move.

   - `java.lang.invoke.MethodHandles::arrayConstructor` will be joined
     by a method of the same name on `MethodHandles.Lookup` which
     performs a companion check before allowing the array constructor
     method handle to be returned.  It will _not check the class_, just
     the companion.  Note that the use of caller sensitivity in the
     `Lookup` API is concentrated on the factory method 
`Lookup::lookup`,
     which is the starting point for `Lookup`-based negotiation.

### Miscellaneous privatization checks

Besides newly-created or extended arrays, there are a few API points
in `java.lang.invoke` which expose default values of reflectively
determined types.  Like the array creation methods, they must simply
refuse to expose default values of privatized value companions.

   - `MethodHandles::zero` and `MethodHandles::empty` will simply
   refuse to produce a result of a privatized `C.val` type.  Clients
   with a legitimate need to produce such default values can use
   `MethodHandles::filterReturnValue` and/or `MethodHandles::constant`
   to create equivalent handles, assuming they already possess the
   default value.

   - `MethodHandles::explicitCastArguments` will refuse to convert from
   a nullable reference to a privatized `C.val` type.  Clients with a
   legitimate need to convert nulls to privatized values can use
   conditional combinators to do this "the hard way".

   - The method `Lookup::accessCompanion` will be defined analogously
   to `Lookup::accessClass`.  If `Lookup::accessClass` is applied to a
   companion, it will check both the class and the companion, whereas
   `Lookup::accessCompanion` will look only at the possible
   privatization of the companion.  (Thus it can simply refer to
   `Reflection::verifyCompanionType`.)

To support reflective checks against array elements which may be
privatized companion types, an internal method of the form
`jdk.internal.reflect.Reflection::verifyCompanionType` may be defined.
It will pass any reference type (regardless of class accessibility)
and for a value companion it will check access of the companion (but
not the class itself).

### Building companion-safe APIs

The method `Lookup::arrayConstructor` gives enough of a "hook" to
create all kinds of safe but friendly APIs in privileged JDK code.
The methods in `java.util` could make use of this privileged API to
quickly adapt their internal code to create arrays in cases they are
refused by the existing methods `Array.newInstance` and
`Arrays.copyOf`.

For example, a checked method `MethodHandles.Lookup::defaultValue(C)`
may be added to provide the default value `C.default` if its companion
`C.val` is accessible.  It will operate as if it first creates a
one-element array of the desired type, and then loads the element.

Or, a caller-sensitive method `Class::defaultValue` or `Class::newArray`
could be added which check the caller and return the requested result.
All such methods can be built on top of `MethodHandles.Lookup`.

In general, a library API may be designed to preserve some aspect of
companion safety, as it allows untrusted code to work with arrays of
privatized value type, while preventing non-constructed values of that
type from being materialized.  Each such safe and friendly API has to
make a choice about how to prevent clients from creating
non-constructed states, or perhaps how to allow clients to gain
privilege to do so.  Some points are worth remembering:

  - An unprivileged client must not obtain `C.default` if `C.val` is 
privatized.
  - An unprivileged client must not obtain a non-empty `C.val[]` array 
if `C.val` is privatized and non-atomic.
  - It's safe to build new (non-empty, mutable) arrays from (non-empty, 
mutable) old arrays, if the default is not injected.
  - If a new array is somehow frozen or wrapped so as be effectively 
immutable, it is safe as long as it does not expose `C.default` values.
  - If a value companion is `public`, there is no need for any 
restriction.
  - Also, unrestricted use can be gated by a `Lookup` object or caller 
sensitivity.

> In the presence of a reconstruction capability, either in the
language or in a library API or as provided by a single class,
avoiding non-constructable objects includes allowing legitimate
reconstruction requests; each legitimate reconstruction request must
somehow preserve the intentions of the class's designer.
Reconstruction should act as if field values had been legitimately
(from `C`'s API) extracted, transformed, and then again legitimately
(to `C`'s API) rebuilt into an instance of `C`.  Serialization is an
example of reconstruction, since field values can be edited in the
wire format.  Proposed `with` expressions for records are another
example of reconstruction.  The `withfield` bytecode is the primitive
reconstruction operator, and must be restricted to nestmates of `C`
since it can perform all physically possible field updates.
Reconstruction operations defined outside of `C` must be designed with
great care if they use elevated privileges beyond what `C` provides
directly.

## Summary of user model

A value class `C` has a value companion `C.val` which denotes the
null-hostile (zero-initialized) fully flattenable value type for `C`.

Like other type members of `C`, `C.val` can be declared with an access
modifier (`public` or `private` or neither).  It is therefore quite
possible that clients of `C` might be prevented from using the
companion type.

The operations on `C.val` are almost the same as the operations on
plain `C` (`C.ref`), so a private `C.val` is usually not a burden.

Operations which are unique to `C.val`, and which therefore may
be restricted to you, are:

   - declaring a field of type `C.val`
   - making an array with element type `C.val`
   - getting the default flat value `C.default`
   - asking for the mirror `C.val.class`

Library routines which create empty flattenable arrays of `C.val`
might not work as expected, when `C.val` is not public.  You'll have
to find a workaround, such as:

   - use a plain `C` reference array to hold your data
   - use a different API point which is friendly to privatie `C.val` 
types
   - ask `C` politely to build such an array for you
   - crack into `C` with a reflective API and build your own

If you look closely at the code for `C`, you might noticed that it
uses its private type `C.val` in its public API.  This is allowed.
Just be aware that null values will not flow through such API points.
When you get a `C.val` value into your own code, you can work on it
perfectly freely with the type `C` (which is `C.ref`).

If a value companion `C.val` is declared `public`, the class has
declared that it is willing to encounter its own default value
`C.default` coming from untrusted code.  If it is declared `private`,
only the class's own nest can work with `C.default`.  If the value
companion is neither public nor private, the class has declared that
it is willing to encounter its own default within its own package.

If a class has declared its companion non-atomic, it is willing to
encounter states arising from data races (across multiple fields) in
the same places it is willing to encounter its default value.

### Summary of restrictions

 From the implementation point of view, the salient task is restricting
clients from illegitimately obtaining non-constructed values of `C`,
if the author of `C` has asked for such restrictions.  (Recall that a
_non-constructed value_ of `C` is one obtained without using `C`'s
constructor or other public API.)  Here are the generally enforced
restrictions regarding a privatized type `C.val`:

  - You cannot mention the name `C.val` or `C.default` in code.
  - You cannot create and load bytecodes which would implement such a 
mention.
  - You cannot obtain `C.default` from a mirror of `C` or `C.val`.
  - You cannot create a new `C.val[]` array from a mirror of `C` or 
`C.val`.
  - You cannot lengthen an existing `C.val[]` array to contain 
uninitialized elements.
  - You cannot copy an existing array as a new `C.val[]` array, if 
`C.val` is declared non-atomic.

Even so, let us suppose you are an accident-prone client of `C`.
Ignoring the above restrictions, you might go about obtaining a
non-constructed value of `C` in several ways, and there is an
answer from the system in each case that stops you:

  - You can mention the `C.val` or `C.default` directly in code, in 
various ways.
  - After obtaining the mirror `C.val.class` (by one of several means), 
you can call `Class::defaultValue`, `MethodHandles::zero`, or a similar 
API point.
  - If you can declare a field of type `C.val` directly you can extract 
an initial value (or a data-race result, if `C.val` is non-atomic).
  - If you can indirectly create an array of type `C.val`, you can 
extract an initial value (or a data-race result, if `C.val` is 
non-atomic).

And there are a number of ways you might attempt to indirectly create
an array of type `C.val[]`:

  - Indirectly create it from a mirror using `Array::newInstance` or 
`Arrays::copyOf` or `MethodHandles::arrayConstructor` or another similar 
API point.
  - Create it from a pre-existing array of the same type using 
`Object::clone` or `Arrays::copyOf` or another similar API point.
  - Specify such an array on a serialization wire format and deserialize 
it.

Using `C.val` or `C.default` directly is blocked if `C` privatizes its
value companion, unless you are coding a nestmate or package-mate of
`C`.  These checks are applied both at compile time and when the JVM
resolves names, so they apply equally to source code and bytecodes
created by any means whatsoever.

There are no realistic restrictions on obtaining a mirror to a
companion type `C.val`.  (Accidental and casual direct use of
`C.val.class` is prevented by access restrictions on the type name
`C.val`.  But there are many ways to get around this limitation.)
Therefore any method or API which could violate the above generally
enforced restrictions must perform an appropriate dynamic access check
on behalf of its mirror argument.

Such a dynamic access check can be made negotiable by an appeal to
caller sensitivity or a `Lookup` check, so a correctly configured call
can avoid the restriction.  For some simple methods (perhaps
`Arrays::copyOf` or `MethodHandles::zero`) there is no negotiation.
Depending on the use case, access failure can be worked around via a
"negotiable" API point like `Lookup::arrayConstructor`.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/valhalla-spec-experts/attachments/20220702/2819c698/attachment-0001.htm>