Valhalla basic concepts / terminology

Fri May 22 22:13:18 UTC 2020

I like this discussion!  Smart questions and solid answers all the way through.

Weaving in my $0.02…

On May 22, 2020, at 12:36 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
> Hi Kevin!
>> 
>> 	• There are two kinds of objects/instances; the notions "object" and "instance" apply equally to both kinds. These are "inline objects" and "identity objects". Statements like "it's an instance, so that means it's on the heap" and "you can lock on any object" become invalid, but statements like "42 is an instance of `int`" are valid.
> 
> Correct.  
> 
> From a pedagogical perspective, it's not clear whether we are better off framing it as a partitioning (there are two kinds, red and blue) or that some objects have a special property (in addition to their state, some objects have a special hidden property, its identity.)   
> 
> We have been going down the former path, but I am starting to think the latter path is more helpful; rather than cleaving the world of objects in two, instead highlight how some (many!) objects are "special”.  

This mirrors our voyage through various early versions of the Q+L design,
which surfaced (in the VM, if that’s surfacing) the two colors, on equal footing.
Then we went t0 L-world, which submerged the differences under L, with
Q-tyeps peeking out only when absolutely necessary. This taught us precisely
how similar the two kinds are, and how they can be handled under the L-type
rubric.  At this point we said goodbye to designs which would include
things like Q-Object as the top of all Q’s disjoint from L-Object.  Now we
have a Q-XOR-L design, where every name is either one or the other.

Retaining the previous insights we now know that there are abstract
types (and quasi-abstract Object and maybe others) which admit *values*
of both colors, dynamically distinguished.  After a little more OO analysis,
we realized that identity objects are a sub-type of all possible objects,
because they have extra operations (synch, side effecting state).  The
inline objects are… objects without those operations.  Which in OO
terms is a super-type, not a disjoint type.

In classic OO discourse, you partition the universe of objects into a
bunch of concrete classes, *and* you group them those same objects
by their super-classes (and interfaces, etc.).  So the objects (instances,
values even) are always only of one concrete class (what Object::getClass)
but abstractly they match to various other types.  And the OO accounts
confuse us when classes (per se) are used to build both classifications:
The disjoint union of values, and the overlapping classifications of
types.

And when I say “type” (what an overloaded term that is!) I’m talking
mainly about variables, and the value-sets that they are configured
to permit.

In the world of instances, you have identity and inline.  In the world
of variables (types), you can have variables which are exactly tied to
a particular identity or inline class, or which can hold either.
(With variables which might refer to identity objects, null is also
a potential value.  A day may come for types like String! but this
is not that day.)  What about a variable which can hold (a reference
to) an instance of *any* class that is an identity class?  Yes, we want
that because there are special operations on such classes, as noted
above.  We currently inject this type as a magic interface.  What
about a variable which can hold an instance of any class that is
an inline class?  That doesn’t make sense from the POV of types
which support operations common to some set of instances,
because inline classes are simpler (less colorful?) than identity
classes.  Right now, to me, having an InlineObject type makes
as much sense as having a not-List type, whose value set
is everything that *doesn’t* act like a list.  OO hierarchies
are built up additively, guided by functions and contracts;
they are not built on exclusions.  (You can have disjoint
unions, as with sealed classes, so we *could* define some
sort of sealing condition imposed on concrete classes, as
{Object…} = DISJOINT_UNION({null}, {inline…},
{identity…}).  But the use cases we know about aren’t
asking for it, and it adds more complexity to the story
of Java’s type hierarchies without visible benefit.)

> 
>> 		• (do we intend to use the term "object", or use the term "instance", or define the two differently somehow?)
> To the extent we can avoid redefining these things, I think it is easier to just leave these terms in place.

Yes.  AFAIKT  “object” and “instance” are synonyms in Java.  We played
around with having the verbal distinction do work for us; maybe an
“object” is really (was all along) an identity object, while an “instance”
could be either.  But the existing usages of those terms don’t happen
to favor any such scheme.  So we coined new terms.

>> 	• Identity objects get the benefits of identity, at the cost that you may only store references to them. They will ~always go onto the heap (modulo invisible vm tricks).
> Yes.  Again, pedagogically, I am not sure whether the heap association is helpful or hurtful; on the one hand, C programmers will understand the notion of "pointer to heap node", but on the other, this is focusing on implementation details, not concepts.  

+1  The old phrase “on the heap” tempts the reader into all sorts of
conclusions which are outside the spec., as strictly read.  Usually
it’s enough to say that “there is an unending supply of objects (unless
GC frowns on us) of various (concrete) classes”.  Where they go,
on a stack or on a heap, in a box or in a tree, in a house or with a
mouse, is the JVM’s secret.  All that said, it’s still helpful to give
students cartoonish views of what the JVM is doing.  And they
aren’t specs.

> One of the oddest things here is that you can have references to all objects, but only can pass/store inline objects directly -- it's like a 2x2 matrix with one corner blacked out.

I don’t get this.  The JVM can store a Point (an inline object)
internally using either flat words in the enclosing object, or
an invisible reference to the flat words somewhere else
(a box), and the user can’t tell what’s going on.  You can’t
*specify* flatness; you can only rewrite the spec. so that
it doesn’t *forbid* flatness.

So all objects support references (and we have the top type
Object which embodies this fact).  Some objects, because
they lack identity, are easier to store efficiently in some
contexts (e.g., in a strongly typed array element instead
of an Object[] array element).

> 
>> 	• Inline objects forgo the benefits of identity to give you the option to store either a reference to a heap object or the data itself inline.
> Correct.
>> 		• (Users choose by e.g. writing either `Foo.val` or `Foo.ref`, though one would be the default)
> Yes.  It is worth noting here that we would like for the actual incidence of `.ref` and `.val` in real code to be almost negligible.  Maurizio likens them to "raw types", in the sense that we need them to complete the type system, and there are cases where they are unavoidable, but the other 99.9% of the time, you just say "Point”. 

Yes, that’s a great aspiration!

> 
>> 	• 
>> 	• We can also sort concrete classes into just two groups: "inline classes" and "identity classes", each of which begets only its own kind of objects/instances.
> Yes.  All the instances of a class C are either identity objects, or inline objects.  

+1 If a variable of this “class C” can store either kind of object, then C
is probably an interface or a suitable abstract class or Object.  Or a
suitable template, maybe, with specializations of both colors.

>> We don't say "value types" anymore because the term "value" (as in "value set") applies to all types.
> Yes.  The appeal of "value" comes from "pass by value", but there is too much baggage associated with the word value.  

+1 IIRC the existing spec. documents insist that primitives and references
are values, but objects are not values.  It would be very hard to refactor
this use of the word “value”, and would lead to obscurity in the spec.
I think.  A similar point goes for the term “reference”.

> The choice of inline is not perfect; it's a strange word to most people, but it comes with the intuition that an inline object's layout will be "inlined" into containing objects/arrays.  But, it doesn't mean that its methods will always be inlined (though that is more likely as they are final and the VM will have sharp type information.)  

+1 I like the intuition.  It will be abused (like “on the heap” gets abused)
but to techies it is a familiar enough term (like “on the heap” in fact).

All the non-strange words are tied to their existing definitions, and
unluckily for us (though we tried) we couldn’t tweak the definitions
to work for us.  So, new strange words.

> 
>> 	• A concrete class is either an "identity class" or an "inline class". But a compile-time type is distinguished not by "inline vs identity" but by "inline vs reference".
> Yeah, this is the other hard one.  In fact, it took us years to realize that the key distinction is not reference vs primitive/inline, but _identity_ vs inline.  

Yep.  I tried to summarize those years above.

> Here's the scorecard: 
> 
> Object is a reference type.  
> For an identity or abstract class C, C is a reference type.  
> For an interface I, I is a reference type.
> For an inline class V, V is an inline type.  
> Primitives are inline types.  
> 
> A reference type always holds a reference to an object (which might be inline or identity), or null.  

+1 Though we have to be careful with that word “type”.
And not all of those statements make sense when you use
the word “class” instead of type.  Maybe there’s some
invisible markup in these statements:

TYPE[Object] is a reference type.  
For an identity or abstract class C, TYPE[C] is a reference type.  
For an interface I, TYPE[I] is a reference type.
For an inline class V, TYPE[V] is an inline type.  
For a primitive P, TYPE[P] is an inline types.  

There is no useful thing called a “reference class” or
“value class” since all classes admit references and
values which are references.

There is a useful thing called an “identity class”, which
it turns out to coincide with a type called IdentityClass.
It is defined by various rules, the effect of which
is to ensure that any instance is either of an inline class,
or else has been “painted” with the color IdentityClass
to highlight (in an OO way) its willingness to do certain
operations (which are not always methods, so it’s not
just OO stuff; it’s OO plus magic types and operations).

> 
>> 		• must hold a "reference" (or null)
>> 			• Condition: the type (or, for a type variable, its bound): is neither an interface nor "almost-interface"; or is a subtype of IdentityObject; or is an inline class that specifies ref-default; or bears an explict `.ref`.
>> 			• this is probably what the term "reference type" needs to apply to now. For example it is currently "reference types" that my nullness analysis project is concerned with and I think it would remain that way.
>> 			• key: it's always a reference to an instance (well, unless it's not null), but that might be either kind of instance.
>> 		• must hold an inline object
>> 			• Condition: it's a subtype of InlineObject (perhaps by being an `inline class` itself that is a val-default.... or by being primitive?); or bears an explicit `.val`.
>> 			• this is probably what the term "inline type" should refer to.
>> 		• might hold either?
>> 			• or can this not happen because you would be forced to write `.ref/.val`?
> 
> I'm not sure I follow what this section is asking?

I think he’s trying to give the word “reference” a new meaning,
which we gave up on.

> 
>> 	• Primitive types are inline classes, full stop.
>> 		• It's just that for compatibility reasons they get to have custom-built reference projections instead of only the general-purpose `Foo.val` treatment.
> 
> That's where we hope to get, but we will have to break a few eggs to get there.  

Under current rules, java.lang.Integer is an identity class, because,
well, it’s a legacy class, and people can say “new Integer(42)” in their
code.  Under new rules we may have to make it abstract somehow,
so it can have subs of both colors.  One dodge that appeals to me is
to make Integer be a template, so that it can serve both old and new
purposes.  There would be two specializations, one for the old functions
and one for the new.  The new specialization would be a supertype
of “inline class int”.  Reflective code which looked only at classes
(and not species) would see ints (inline instances) and Integers
(both kinds of instances).  The species would reveal more information,
but not (perhaps) via Object::getClass.

> 
> Egg #1: synchronization on wrappers.  Today, you can (but should not) synchronize on a j.l.Integer; to achieve this goal, this would throw.  

+1 (We tried to totalize synchronization on values but it was
nauseatingly unproductive.)

> Egg #1a: possibly, depending on where we land for weak refs, a similar thing will happen for WR<Integer>

+1 This is a current Hard Issue.  Remember when acmp was a
hard issue?  Progress!

> Egg #2: equality.  Today, equality on wrappers is identity based, and the primitive cache makes some small wrappers == to each other; to get to this goal, == would actually be equality on the contained number.  This is arguably better, but different from how it works now.

Or we could allow legacy code to add identity Integers to the
mix, even while most instances were inline ints.  See above
for some brainstorming.

Egg #2a: System.identityHashCode has to evolve along with
acmp.

Egg #3: Visible class hierarchy.  “class int” has to find its home
near “class Integer”.  The mirror int.Class needs a makeover.

— John