[lworld] RFR: 8361352: [lworld] Weird interaction between OSR and value class instances

Fri Aug 22 14:16:16 UTC 2025

First, credit to @TobiHartmann for the diagnostic, and a lot of the solution.

# Diagnostic

According to [Strict Field Initialization JEP](https://openjdk.org/jeps/8350458), when a strict field is being initialized, it is not quite immutable, but observally immutable: at first, the field can be only set (during the early larval phase), then it can be only read (late larval or initialized phase), so the last set is the actual value one can ever observe.

The interesting part is that in early larval phase, a field can be subject to some side effects. When applied to a value object, that means that until it reaches the unrestricted state, it is not yet immutable. While being not theoretically necessary, avoiding scalarization and keeping the value object behind a reference is a convenient way to make sure that side effects are correctly applied. This strategy means that we shouldn't scalarized before reaching the unrestricted state. Normally, in C2, finding out what is early larval or not is the job of bytecode parsing, but in OSR compilation, everything about the StartOSR is not parsed, and thus some objects are soundly assumed that they might be larval, when they actually aren't. In the reported example, that leads to drastic performance difference between OSR and non OSR compilation: the second one is able to eliminate allocations since it knows more precisely when the value object can be scalarized.

In the original example:

public value class MyNumber {
    private long d0;
    private MyNumber(long d0) { this.d0 = d0; }
    public MyNumber add(long v) { return new MyNumber(d0 + v); }

    private static void loop() {
        MyNumber dec = new MyNumber(123);
        for (int i = 0; i < 1_000_000_000; ++i) {
            dec = dec.add(i);
        }
    }

    public static void main(String[] args) {
        for (int i = 0; i < 10; ++i) {
            loop();
        }
    }
}

OSR happens in the loop in `loop`, but here `dec` is not detected to be unrestricted (so immutable, so scalarizable), so the allocation in inlined `add` still needs to happen because we need the buffer for the new `dec`. The first iteration traps at the exit of the loop (unstable if), OSR happens again, followed by a non-OSR compilation, finding correctly that `dec` can be scalarized in the loop, making the third iteration fast.

# Solution

Overall, the solution requires to improve our detection of early larval values. Since we keep parsing as-is, let's do that in `ciTypeFlow`. The logic is quite simple: `ciInlineKlass` can be wrapped in `ciWrapper` to mark when they are known to be non-null. We extend `ciWrapper` to also be able to tell when a value is early larval (which also implies non-null since early larval means that `new` happened).

Then, it's almost a state machine! How is "early larval" introduced:
- `new` bytecode
- receiver of a constructor

How larval is removed:
- call to constructor

Other ways to obtain values are safe (I've experimented and it seems the verifier was keeping me from doing something bad):
- returned values are not early larval
- parameters are not early larval
- receiver of everything but constructor are not early larval

And other guarantees:
- calling twice the constructor is not legal
- merging paths with inconsistent state (early larval vs. unrestricted) is also disallowed

Let's detail how to remove the early larval state on a `ciInlineKlass`. When invoking a method on a receiver, we know that the receiver is not early larval after: either it's a constructor (which initialize the object), or it's a regular method, and the verifier makes sure the receiver was already not early larval before. The tricky part is to remove the larval state on all the copies of the reference: when doing something such as

new SomeValueClass
dup
dup
astore 1
invokespecial <init>

we still have a copy of the same reference in local 1 and on the stack. All of them now refers to a non-larval value. To take that into account, one must scan all the cells (locals and stack), and remove the early larval wrapper to all the types sharing the same ident  (see `void ciBaseObject::set_ident(uint id)` and `void ciObjectFactory::init_ident_of(ciBaseObject* obj)`). That is enough since early larval values cannot be contained in another object (since we can't make use of them before reaching unrestricted state, see also [Flexible Constructor Bodies](https://openjdk.org/jeps/513)).

Another trick is about the super constructor invocation in a constructor: is that different from a constructor invocation outside of a constructor. Well yes, but actually no:
- outside a constructor, a constructor invocation leaves the object initialized
- in a constructor, a constructor invocation doesn't make the object initialized because the all constructor is not done. But, calling the base class' constructor ensures that `Object.<init>` is being called, climbing up the hierarchy. Once `Object`'s constructor has been reached, the object is late larval.

In both case, after a constructor invocation, the object is unrestricted (late larval or initialized), and in the case of a value object, immutable, so scalarizable, which is what matters here.

The case of an abstract value class needs a bit of hack. An abstract value class is an `InstanceKlass` and a `ciInstanceKlass`, so we should take care of not doing the aforementioned logic only for `ciInlineKlass`, but also for `ciInstanceKlass` that are not identity classes.

# Tests

Testing is not quite easy. First, there is a microbenchmark.

Running

make test TEST="micro:org.openjdk.bench.valhalla.loops.osr.LarvalDetectionAboveOSR" CONF_NAME=linux-x64

before the fix would give

Iteration   1: 1874.498 ms/op
Iteration   2: 1662.409 ms/op
Iteration   3: 204.573 ms/op
Iteration   4: 201.177 ms/op
Iteration   5: 201.526 ms/op
Iteration   6: 200.718 ms/op
Iteration   7: 198.127 ms/op
Iteration   8: 200.535 ms/op
Iteration   9: 201.079 ms/op
Iteration  10: 203.341 ms/op

and now

Iteration   1: 212.941 ms/op
Iteration   2: 225.214 ms/op
Iteration   3: 204.319 ms/op
Iteration   4: 199.884 ms/op
Iteration   5: 201.875 ms/op
Iteration   6: 202.331 ms/op
Iteration   7: 200.829 ms/op
Iteration   8: 200.794 ms/op
Iteration   9: 200.428 ms/op
Iteration  10: 203.278 ms/op

That's better! For this microbench, we need to do everything that we shouldn't usually do: no warmup, single measure... Indeed, as soon as we do more than OSR, the effect disappear.

Otherwise, it's nice to see that the allocation is not there anymore in OSR. That could be an IR framework test, but alas, IR framework doesn't support OSR so easily. The test is thus running a subprocess and parsing the output quite basically. This should be ported to an IR test once the IR framework supports OSR.

Thanks,
Marc

-------------

Commit messages:
 - cleanup
 - abstract value class
 - Adjust microbench + parameters are never early larval
 - Less printing for testing
 - Trying
 - working....

Changes: https://git.openjdk.org/valhalla/pull/1531/files
  Webrev: https://webrevs.openjdk.org/?repo=valhalla&pr=1531&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8361352
  Stats: 424 lines in 12 files changed: 386 ins; 18 del; 20 mod
  Patch: https://git.openjdk.org/valhalla/pull/1531.diff
  Fetch: git fetch https://git.openjdk.org/valhalla.git pull/1531/head:pull/1531

PR: https://git.openjdk.org/valhalla/pull/1531