Recursively comparing

Wed Feb 28 10:56:24 UTC 2024

Thanks, John, for the usual clear, precise and informative exegesis.

I have been lurking on valhalla-dev for over 5 years now precisely so I 
can catch 'public service announcements' like this one that pop up every 
now and again from John, Brian and others. Likewise on other lists. I am 
tempted to propose a new OpenJDK mail list, oracle.openjdk.org (note 
that the 'oracle' component implies revelation not company affiliation) 
where the rules of engagement specify a single weekly post from a 
suitably knowledgeable JDK or JVM contributor explaining the rationale 
and accompanying history for why some specific aspect of OpenJDK works 
the way it does/is the way it is.

Probably not workable. Nevertheless it is immensely valuable for 
knowledge & understanding like this to be shared.

regards,

Andrew Dinn
-----------
Red Hat Distinguished Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill

On 28/02/2024 06:28, John Rose wrote:
> This is an interesting discussion.  I’m not surprised
> that people are surprised by ==/acmp semantics.
> We surprised ourselves, at first.
> 
> On 27 Feb 2024, at 17:04, John Bossons wrote:
> 
>> Hi Remy,
>>
>> Even for a value object field in which the stored VO value is a reference?
> 
> Yes, that is Remy’s point.  It doesn’t matter how the virtual
> machine concretely stores the value.  The semantics are derived
> from virtual data, not physical data.
> 
> The VM makes optimization decisions below the “virtual metal”.
> Users should expect that these decisions have no effect on program
> semantics (and so program outputs), other than perhaps speed
> or footprint.  How would users like it if their methods with
> floating point computations got slightly different results,
> after the JIT kicked in?  Answer: It would be a bug, and they
> wouldn’t.
> 
> It’s the same with the VM’s choice of how to implement the
> virtual semantics of values, using a variety of techniques,
> including “buffering” as a separate node in the heap.
> “Buffering” is like what we have called “boxing” in the past.
> But it’s different: The identity of a “box” is sadly exposed
> in VM semantics, while a “buffer”’s identity is suppressed.
> 
> If a value is not flattened in its container, it may instead
> be buffered on the heap, with a pointer left behind in the
> container.  (It might also be lifted to registers after
> inlining and escape analysis; this is a different kind of
> flattening we call “scalarization”.)  But the meaning of
> ==/acmp is the same regardless of details of physical
> representation of its operands.
> 
>> As I understand the spec, it's up to the JVM to choose how to handle a
>> value object (even a null-restricted VO), depending on its size. If a VO
>> field value is a VO, it's optional as to whether the JVM stores that field
>> value as a reference or as a value. What the JVM does will depend on the VO
>> size, right?
> 
> Sort of right, but also an oversimplification.  There are many
> factors besides size.  Another factor can be whether the container
> is final or not, and yet another is whether it is volatile or not.
> As a rule of thumb, users should always expect the VM to surprise them,
> if they decide to open the hood and peek below the virtual metal.
> 
>> If a VO field value is stored as a reference, calling == on
>> that reference will behave the same way as calling == on an identity object
>> reference.
> 
> I am at a loss to understand the grounds of your confident assertion
> here.  In any case, it is false.  Consider the consequences of such
> a design:
> 
> A. The VM will unpredictably choose to buffer, flatten, or scalarize
> values.
> 
> B. These decisions are partially dependent on heap layout policy, which
> is stable over time (at this current date) but also dependent on JIT
> activity, which varies over time.
> 
> C. Therefore, from run to run, or even within one run, a value which
> compares equal to itself at one point will compare unequal to
> itself later on, or vice versa.  We think users would not welcome
> this kind of uncertainty in Java code.
> 
> <history>
> 
> Full disclosure:  About 10 years ago, I thought we could get away
> with it.  We’d have to tell users not to fully trust the results
> of ==/acmp.  Instead they must learn to always follow up == with
> a call to Object::equals.  I thought this might be tolerable
> because users already do this as a matter of habit.
> 
> The basic rule would have been, if two values compare ==,
> you know they must be the same value, but if they don’t,
> they still might be the same value.  You must call equals
> if you need the accurate result.  So doing the extra call
> restores predictability, of the expression as a whole.
> 
> I called this the Heisenbox Model, because you would always
> have a kind of “uncertainty principle” about whether a given
> value would be equal to itself, or whether it would suddenly
> turn into two separately buffered copies, whose pointers would
> be suddenly different.
> 
> This can happen, for example, if JIT code deoptimizes to the
> interpreter, and the two copies of the same value are buffered
> separately, perhaps because they are in two stack frames
> related by inlining.  Reminder: You will be surprised if
> you peek inside the VM.  It is the VM’s responsibility to
> keep an evenhanded pretense of consistent behavior, whatever
> its internal gymnastics.
> 
> So, changing execution paths to looking at values from
> a new angle might cause a deopt, and change the result
> of your comparison.  Observation collapses the
> configuration to a random outcome… voilà Heisenboxes.
> (Apologies to the real physicists.)
> 
> FTR, my first public writing on value classes in 2012
> despaired of assigning any kind of predictable value
> to ==/acmp, which anticipated Heisenboxes as well:
> https://cr.openjdk.org/~jrose/oblog/value-types-in-the-vm.html
> I am glad most of it has proven wrong, including
> the bit about ==.  My colleagues finally beat it
> out of me. The last gasp was in 2016:
> https://bugs.openjdk.org/browse/JDK-8163133
> 
> One reason I thought we could get away with Heisenboxes is
> there is a similar behavior, historically, in Common Lisp.
> The EQ predicate distinguishes certain numeric values that
> are numerically indistinguishable, if they happen to be
> buffered separately.  The EQV predicate fixes that, at a
> higher cost.  (And EQ and EQV are the same on Lisp’s version
> of identity objects, which is basically all objects except
> numbers.)  So, relative to Lisp, our decision is that values
> are always compared by value, like EQV, and never by buffer
> identity like EQ.  Dropping EQ from the model takes out a
> bunch of shifty uncertain behavior; now you can trust ==/acmp
> to do the same thing all the time for the same inputs.
> 
> </history>
> 
>> If the VO is stored as a value (a set of values which could
>> include another VO. field), then the == will be recursive. Am I incorrect?
> 
> Regardless of how the VM stores the VO, the == will be recursive.
> 
> Some find the following corollary to this principle surprising:
> 
> Neither the type nor the implementation of the variable containing
> the value affects the equality comparison.  Therefore, if you have
> two Object pointers compared with ==/acmp, and they are both values
> (of the same class), there will be a recursive descent.  It doesn’t
> matter that you thought you have mere Object pointers.  The VM
> does the same work for value comparison regardless of context.
> 
> This also means that if a value contains an Object field, then
> the recursive ==/acmp on a pair of such values will, in fact,
> test the corresponding Object references (as if by ==/acmp)
> and recurse if and only if the two objects are both values
> of the same class.
> 
> Finally, this also means that if one has a value with two
> Object fields, an ==/acmp operation might possibly perform
> an unbounded recursive descent of a tree.  At this point,
> a class designer might have to think hard about what is
> the desired behavior.  One way to break the recursive
> descent is to use an intermediate heap node that is an
> identity object, such as an array.  That is in fact how
> recursive structures are often written.  But for a pure
> immutable fixed-arity tree class, refactoring to a value
> will involve consideration of the cost of ==/acmp.  The
> design of such classes (such as HAMTs) is rarely attempted,
> and by experts when done at all.  I am not worried about
> such experts being overwhelmed by the odd scaling of
> recursive descent in their exquisitely tuned classes.
> 
> So we have been thinking about all this diligently and
> carefully for a decade, and JEP 401 is the result.
> 
> Some might think it mandates too much recursion for ==/acmp.
> Users should prepare, however, to be surprised by what the VM
> will do to make such things fast.  Remember the old
> conventional wisdom about bytecode-based VMs and pervasive
> virtual calls:  Java was supposed to be slow because of
> such things… until it wasn’t slow anymore.  One promising
> idea (IMO) for reducing ==/acmp costs is folding them with
> equivalent tests in Object::equals:
> https://bugs.openjdk.org/browse/JDK-8255024
> 
> So VM internals are surprising.  That’s why (a) we are
> required, and (b) we are able, to design high-level VM
> semantics, with no uncertainties. Even if they might
> seem “too expensive” at first blush.
> 
> I hope these perspectives will be useful.  They have
> been hard-won.
> 
> — John