Rules for larval value object construction

Sat Dec 2 07:35:43 UTC 2023

----- Original Message -----
> From: "John Rose" <john.r.rose at oracle.com>
> To: "daniel smith" <daniel.smith at oracle.com>
> Cc: "valhalla-spec-experts" <valhalla-spec-experts at openjdk.java.net>
> Sent: Friday, December 1, 2023 11:01:42 PM
> Subject: Re: Rules for larval value object construction

> Thanks for the update Dan. I am very encouraged with the increased simplicity
> and clarity of our new primitive notions.
> 
> The original concept of value classes factors into final classes with strictly
> initialized finals and no non-final instance fields, plus identical
> restrictions on all their supers. By fiat they also are given identity masking
> and, optionally, implicit zero-construction and, optionally again, relaxed
> field consistency.  This constellation of features unlocks a new set of
> optimizations.

For the identity class, I do not think we can avoid an opt-in mecanism given that the semantics of calling the super constructor early or late is different
    /*abstract or not*/ class A {
      A() {
        System.out.println("side effect " + this);
      }
    }
    final class B extends A {
      final int value;

      B(int value) {
        super();  // early super call (inserted by the compiler)
        this.value = value;
        super();  // or late super call (inserted by the compiler)
      }
    }

If we use an opt-in mechanism, I would prefer it to be a class-wide opt-in. 
Because, we have already the concept of stricter final fields, final fields of hidden classes and records are not modifiable by reflection and I do not really see the point of mixing early initialized final fields and late initialized final fields.

So let say we have a __strict__ keyword on classes. A __strict__ class must have all its final fields initialized before calling the super-constructor. For a constructor of a __strict__ class, the compiler insert the call to super() at the end. If people want to use "this" after the initilization of the fields, they are required to write the super call explicitly.

At VM level, we have a modifier bit on the class saying the class is __strict__.

All value classes are __strict__ by default, and their super classes must be strict too. Record are modified to be __strict__ by default. All non-mutable classes of the packages java.lang/java.util like java.lang.Enum, java.lang.String are modified to be __strict__.
At some point, javac/IDEs should warn users to declare the class __strict__ if there is at least one final field to raise awareness.

Later, like we have done with ACC_SUPER, we can mandate all classes to be __strict__ for a specific version of the class file.

Obviously, we can discuss to take a shortcut a jump to mandating all classes to be __strict__ without introducing a keyword only using the class file version, but my fear is that people will remove the "final" keyword if the compiler starts to ask all final fields to be initialized before the super() call, but i may be wrong. 

regards,
Rémi

> 
>> On Nov 28, 2023, at 5:03 PM, Dan Smith <daniel.smith at oracle.com> wrote:
>> 
>> Here's an update on how these ideas about construction have panned out as we've
>> dug more into them. Some of the details are still in flux, but I'll try to
>> highlight what's settled and what's still being explored.
>> 
>>> On Sep 19, 2023, at 5:24 PM, Dan Smith <daniel.smith at oracle.com> wrote:
>>> 
>>> We've spent some time investigating the idea of larval value object
>>> construction, and are enthusiastic about this change. Here are some details.
>>> 
>>> 1) In the JVM, we no longer need the <vnew> special methods or the 'aconst_init'
>>> and 'withfield' instructions. Drop them.
>> 
>> Yep, check.
>> 
>>> 2) Value objects are constructed just like identity objects, via field mutation.
>>> However, the "object" being mutated is in a larval state and we want to give
>>> implementations a lot of flexibility in how these larval objects are
>>> implemented. Thus, the language and JVM must ensure that a larval object is
>>> never observable—it can be written to but not read.
>> 
>> Yes. This is the big idea: constrain access to the mutable larval instance, only
>> allow programs to observe the instance once we no longer need
>> mutation/identity.
>> 
>>> 3) New language concept/modifier: a 'regulated' constructor. (Trying a name out
>>> here, haven't put a ton of thought into naming. Feel free to bikeshed.) The
>>> 'regulated' keyword can be applied to any constructor in any class.
>>> 
>>> 4) A 'regulated' constructor promises not to make any use of 'this'. An error
>>> occurs if this promise is violated in any of a number of different ways: direct
>>> reference, instance field access (except for assignment targets), instance
>>> method invocation, implicit use as an enclosing instance, any of these
>>> occurring in an initializer expression/block, and, importantly, a 'super()' or
>>> 'this()' call to a different non-regulated constructor. These checks are very
>>> similar to the longstanding rules for "pre-construction contexts" clarified by
>>> JEP 447.
>> 
>> After taking a serious look at the design of object construction, we've refined
>> this approach. Rather than putting a blanket restriction on 'this' references,
>> we want to encourage object state to be initialized *before* the 'super()'
>> call, guaranteeing that it will be properly initialized before the language &
>> verification rules allow any use of 'this'. Then we don't need to impose any
>> restrictions on 'this'.
>> 
>> So, for example, the fields of a value class must be initialized early.
>> (Bikeshed: the fields and the class that declares them require "strict
>> initialization", or something like that. Note, FWIW, that the JVM has an unused
>> ACC_STRICT flag lying around...) We can relax the language rules to allow
>> *writes*, but not *reads*, to instance fields in a JEP 447 "pre-construction
>> context".
>> 
>> value class Foo {
>>    int x; int y;
>> 
>>    public Foo(int z) {
>>        this.x = z / 5;
>>        // can't use 'this' here: if (x > 0) ...
>>        this.y = z % 5;
>>        super();
>>        if (this.x == this.y) ... // no problem, can use 'this'
>>    }
>> 
>> }
>> 
>> The concept of writing to fields before the 'super()' call may seem
>> uncomfortable at first, but it has been allowed by the JVM for a long time.
>> Enclosing instances of inner classes, in particular, have always been "strictly
>> initialized", with javac generating the bytecode to set the field before it
>> calls 'super()'.
>> 
>> In a simple constructor without a 'super()' call, the implicit call no longer
>> happens right at the start, allowing the fields to be properly initialized.
>> 
>> value class Foo {
>>    int x; int y;
>> 
>>    public Foo(int z) {
>>        this.x = z / 5;
>>        // can't use 'this' here: if (x > 0) ...
>>        this.y = z % 5;
>>        // implicit: super();
>>    }
>> }
>> 
>> One open question we have about implicit 'super()': if there are extra
>> statements after all fields are initialized, do we go ahead with the 'super()'
>> call as soon as possible? Or do we have a blanket rule that says 'super()'
>> always goes last? There are pros and cons in both directions.
>> 
>> 'this'-calling constructors are similar, except that they're not allowed to
>> write to final fields, because that's the responsibility of the delegated-to
>> constructor. But they can freely use 'this' after the 'this()' call.
>> 
>> In corpus analysis, we've found that, for final fields whose initialization
>> doesn't depend on 'this', the timing of super constructor execution can be
>> proven, using some pretty simple heuristics, to be irrelevant in the
>> overwhelming majority of cases. (E.g., maybe the subclass just assigns a
>> parameter to the field; or maybe the superclass constructor is empty.)
>> 
>> Can we generalize this strict initialization capability to 'final' fields in
>> other classes? Sure! Allowing assignments to final fields (and perhaps fields
>> in general) before an explicit 'super()' is straightforward. For compatibility
>> reasons, it gets trickier to adjust the timing of implicit 'super()' calls, but
>> there may be something we can do, maybe automatically or maybe with an opt in.
>> 
>> And note the benefits on the other side: a 'final' field that is guaranteed
>> never to be observed to mutate (again, maybe call it a "strict final field")
>> can be optimized in ways that our existing "mostly final" fields cannot. As one
>> example, flattened strict-final fields will never experience read/write races,
>> so need not have atomic encodings.
>> 
>>> 5) The no-arg constructor of 'Object' is 'regulated'. The default constructor of
>>> any other class (the one you get if you declare no constructor) is implicitly
>>> 'regulated' if the superclass's no-arg constructor is 'regulated'. (A slight
>>> incompatibility here that I think we can tolerate: if you have an implicit
>>> constructor but also an initializer that depends on 'this', an error occurs.)
>> 
>> In general, there's no longer a "regulated" property to worry
>> about/infer/advertise. If you safely initialize your fields before calling
>> 'super()', you don't care what your superclass constructor does. Strict
>> initialization is a local, implementation-level property of the class.
>> 
>>> 6) Every constructor of a value class is implicitly 'regulated'. (Similar to the
>>> rule that says every instance field of a value class is 'final'.)
>> 
>> Value classes must be "strictly initialized"—i.e., the (always-final) fields of
>> value classes must be initialized early.
>> 
>> For superclasses of (concrete) value classes, we rely on the existing notion of
>> identity/value/unrestricted abstract classes: value classes cannot extend
>> identity classes; and in the value/unrestricted cases, all fields (if any) must
>> be final and early-initialized. (Still working out details for how to
>> infer/declare which category an abstract class belongs to.)
>> 
>>> 7) In the class file, there's an ACC_REGULATED flag for methods, only allowed to
>>> be applied to <init> methods.
>>> 
>>> 8) At class loading, an ACC_VALUE class requires that its constructors all be
>>> ACC_REGULATED.
>>> 
>>> 9) Verification ensures that 'this' is never read from or passed out of an
>>> ACC_REGULATED <init> method. Specifically, in such a method, the type of 'this'
>>> is 'uninitializedThis' even after the invokespecial super/this call.
>>> Verification also ensures that the target of an invokespecial super/this call
>>> is ACC_REGULATED. (See detailed rules below.)
>> 
>> Still working out JVM details, but for runtime safety, what we need is some way
>> to assert that a final field must be early-initialized, and then a mechanism to
>> reject attempts to do 'putfield' after the 'super()' call.
>> 
>> Tentatively:
>> 
>> - Final instance fields could be marked ACC_STRICT. Or maybe we could infer this
>> property based on context (like ACC_VALUE on the class).
>> 
>> - A linkage error already rejects 'putfield' outside of a final field's
>> constructor. So we just need to manage 'putfield' within <init>. And the
>> verifier already distinguishes between "early" and "late" putfields (because
>> the types involved are different). It would be trivial to add a verifier
>> condition that the target of a "late" putfield must not be ACC_STRICT.
>> 
>> - Alternatively, we have some ideas about tracking a "larval" state on objects,
>> and dynamically rejecting a 'putfield' to an ACC_STRICT field of a non-larval
>> object.
>> 
>>> 10) The point at which control returns to a invokespecial of an <init> method
>>> that *doesn't* represent a super/this call is the point at which a larval
>>> object becomes promoted to a "real" value object and the verification type can
>>> be 'LValueClass;'. No spec changes here, other than conveying that concept.
>> 
>> As suggested by the above rules, this approach moves the larval-->adult
>> transition a bit earlier, to the entry point of the Object.<init> constructor.
>> This is expressed with the existing verification type system, which transitions
>> from 'uninitializedThis' to 'LFoo;'.
>> 
>>> Some benefits of this approach:
>>> - Garbage collecting special new opcodes and the <vnew> methods, a big
>>> simplification
>>> - Full binary compatibility when a class is refactored from identity to new, or
>>> vice versa
>>> - (I think) ability to translate an identity class to a value class by simply
>>> flipping a flag bit
>>> - Surfacing of a useful general-purpose concept: constructors that can be
>>> counted on not to leak 'this'
>>> - Fewer restrictions on value objects' superclass constructors: allow code,
>>> access control, and checked exceptions
>>> - Support for instance fields in superclasses of value objects (!)*
>>> 
>>> (*We can logically allow for superclass fields, anyway. It's possible we'll
>>> decide there are implementation constraints that prevent implementing this
>>> immediately.)
>> 
>> Basically the same list. Early initialization does require some change to the
>> <init> bytecode, so migration is not as simple (at the bytecode level) as
>> adding 'ACC_VALUE'. But since most value class candidates don't need an
>> explicit super() call, at the source level migration may be as easy as adding
>> the 'value' modifier.
>> 
>> The (very useful) general-purpose concept here is "strict construction" or
>> "strict-final fields".
>> 
>> This approach asks *even less* of value objects' superclasses: if they have
>> fields, those will need to be early-initialized. Beyond that, the constructors
>> can do whatever they want.
>> 
>>> One risk is that ACC_REGULATED methods must not be instrumented in ways that
>>> make use of 'this'. This restriction applies to the constructor of Object. What
>>> are the chances this breaks somebody's tooling?
>> 
>> This is one motivation for the shift in strategy: it actually seems fairly
>> common for instrumentation users to want to add some this-dependent code to
>> Object.<init>. (There are examples of this in our own documentation, in
>> fact...)
>> 
>> If a value class is strictly constructed, there is no problem with executing
>> arbitrary instrumentation code into the Object.<init> constructor. That code
>> should run just fine.