We have to talk about "primitive".
John Rose
john.r.rose at oracle.com
Thu Dec 16 03:15:32 UTC 2021
On 15 Dec 2021, at 10:42, Kevin Bourrillion wrote:
> …
> The main problem I think we can't escape is that we'll still need some
> word
> that means only the eight predefined types. (For the sake of argument
> let's
> assume we can pick one and lean hard on it, whether that's
> "predefined",
> "built-in", "elemental", "leaf type", or whatever.)
As others have said, we’ll pick a term for this. The idea of calling
out a “leaf” in a data graph is compelling to me. As you say,
people are going to wonder what is the foundation of the whole scheme.
(No it’s not objects all the way down, at least that’s not what we
are aiming for.)
(But—spoiler alert—the division between leaf/scalar/basic type and
composite/class type is *less important in daily practice* than the ad
hoc mental models programmers make about which types they choose to view
as composite and which are indivisible. Typical example: Most
programmers choose to regard `String` as a sort of nullable primitive.
I’ll pick up that thread later.)
I like the term “basic type”, and (as we already discussed) I like
“scalar” also, because “scalar” correctly suggests something
about how it’s processed in hardware.
Here’s a point I think is also important and has not been discussed
much yet: A concept like “basic type” (or “scalar type”) should
include references as well as Java’s eight current primitive types.
Like an `int` or other basic primitive, a reference is copied by value,
processed efficiently (probably in a hardware register), and is a
“leaf item” with respect to a single object layout or method type
signature. Also, like `int`, a reference has its own special operators
in the language and special bytecodes in the JVM. Like `int`, it has a
default value `null` (instead of `0`).
The main difference of a reference from an `int` is the fact that it has
a far end: You can often (not always) find other values by indirecting
the reference and loading a field or calling a method or querying a
super type. (Because it has a far end, it also has a nominal subtype to
classify what might be at the far end. But I’m speaking here about
references per se, apart from their subtypes.) Despite their “far
end”, people treat some reference types, like `String`, as if they
were leaves; you stop at the `String` and don’t bother thinking about
its fields. Users don’t care that there’s an array somewhere on the
other end, unless they are engineering the string class itself. So a
reference has a far end, unlike an `int`, but, like an `int`, a
reference *often* is treated like an unstructured value, in code.
Bottom line: There are a handful of built-in basic types. These are
used to compose classes. They are the primitives and the references.
When we consider a reference apart from its class (say, as `jl.Object`),
it can be comfortably called a *basic type*, and then that handful of
built-in basic types consists of the (basic) primitives and references.
OK, that’s enough on that. Whether “reference” is a basic type is
less important than how we choose to extend (or not extend) the reach of
the term “primitive”.
For historic reasons we use the word ~~fruit~~ *primitive* to mean a
basic type other than a reference. Now that we have user-defined
`int`-like things, we have to decide whether and how to connect the old
word to the new things. Since user-defined `int`-like things are (we
think) very like `int` in many ways, a term like “extended
primitive” makes sense.
This is how I get to the terms “basic primitive” and “extended
primitive”. Or “scalar primitive” and “extended primitive”.
As I read your messages, you would prefer to keep the term
“primitive” narrow, because of the possible confusion of telling
users “hey, what you think of as primitives are now the ~~heirloom~~
basic primitives.” Personally, I think users will say, to our
unveiling “extended primitives”, something like this:
>> Well, that’s not exactly what the dictionary says primitive means,
>> if you can make new composite ones. But I do know that Java has
>> non-reference types and calls them “primitive”. And I also know
>> it would be really cool to define new types that work like `int`,
>> such as `UnsignedInt` or `HalfFloat` or the like. I get why they
>> don’t want to build all such types into the language; in fact maybe
>> I’d like to try my hand someday at defining my own. So,
>> “extended primitive”. It’s on: The Java primitives are now an
>> open-ended set just like the Java objects.
In other words, in saying “extended primitive” (and also “basic
primitive”) we lean away from the dictionary definition of
“primitive” and into the Java definition. That feels like a
non-confusing choice to me.
>
> Definitely, our trying to minimize their specialness is virtuous.
Yep. We also call this “healing the rift”, sometimes.
> …
> So we have to attempt to shift users' understanding of "primitive"
> while at
> the same time injecting a new term to mean exactly what primitive used
> to
> mean. That's the old Indiana Jones switch and I don't have to tell you
> how
> that turned out for him.
So, no, it’s not the Indy switch, at all. Users know what ~~fruit~~
primitives are in Java, and they will have no problem with adding new
~~imported exotic apples~~ extended primitive to the familiar set of
primitive types. And in exchange for this infusion of wonderful new
types, they will learn a new term for the old types, which is ~~pears~~
basic primitives (or scalar primitives).
>
> It would be difficult to pull off in a world where we were just
> pushing
> some new server and the whole world gets the new model at once. But in
> this
> universe where every version of Java ever made all have to coexist,
> it's
> looking to me like a guaranteed source of never-ending confusion.
>
> I also think it robs us of our ability to smoothly portray the real
> changes
> of Valhalla. We want to be able to say "elements are still elements!
> now we
> have molecules too".
There are two kinds of users w.r.t. the question of “what’s a
primitive” and you can’t please both. You and I want to please
different kinds. The user I want to please is one who thinks of “Java
primitive” as a kind of non-nullable scalar number (or boolean or
char). The user you want to please thinks of “Java primitive” as
“all leaves in the Big Graph”. The latter user will be disappointed
if we say “Java primitives” can be non-leaves. The former user will
be delighted. The latter user sees a `String` and wants to crack out
its underlying array, in a Gollum-like quest for the roots of the
mountains. The latter user treats a `String` as a primitive. There are
more of the former than the latter; we should cater to them. It’s the
former who I was channeling above, concluding with “The Java
primitives are now an open-ended set just like the Java objects.”
> Pedagogically that is always preferable to "elements
> aren't really what you thought they were". Okay, the real comparison
> is a
> little more nuanced than that, but I'll get to that now.
>
> An alternative that seems to work fine, in my mental model at least,
> is:
>
> - Primitive types are examples of value types, and have always
> been.
> - Java never supported any other kinds of value types before, so we
> didn't distinguish the terms before.
> - Everything you associate with primitive types remains true.
> - But most of those traits really come from their value-type-ness.
>
> (I plan to make the above shifts to my model document already.)
The term “value” can be applied to composites in B3 alone, to
composites in B2 alone, or to both. (Or neither.) All the basic types,
including references, are values as well.
This is big choice, where to “spend” the term “value”.
Our choice will be informed and supported by our account about what *we
mean* by the term “value”.
If the word value means “a primitive thing that can be stored in a
register”, then we can’t extend it. So that won’t fly.
For us the word value means something like that but adjusted, “a thing
that is freely copyable and can be stored in one or more registers”.
But look how that affects B2 and B3:
B3 are values, obviously; there is no reference to confuse their free
copying. (There is also no reference to help us adjoin `null` to the
value set, and no reference to help us perform safe publication.)
B2 are references to… well, values as well. They might be on the
heap, or they might be elsewhere; we don’t care because the freely
copyable values are not also accompanied by object identity.
Both B1 and B2 *references* (per se) are, confusingly, also values,
since basic types (and/or references) are freely copyable.
But a B2 reference is a value, which refers to another value. (Proof
they are distinct values: One is possibly null, the other isn’t.)
And like a user using `String`, the value-ness of a B2 reference can be
treated as a single, simple, atomic thing, without further reference to
substructure. In particular, because it’s not B1, there’s no
possibility of state under the B2 reference; there’s just the value
you care about.
I think, because the term value applies in so many places (including B1
references), it will be tricky to use it as a classification (like
“pear”) instead of an assertion of use (like “fruit”).
But given the choice between using the term “value” to classify
types, distinguishing them from B1 types, I think the correct choice is
to apply the term to B2, as “value object” vs. “identity
object”.
The value-ness of B3 (as loose aggregates) and B1 (as references) is
going to add a bit of confusion. Dan did a round of naming where he
used the term “pure object” as the opposite of “identity
object”; now we are at “value object” vs. “identity object”, I
think.
>
> - Now we have user-defined value types too.
> - The way we user-define a type is with a class, so a value type is
> defined by a "value class" (sorry B2).
> - The primitive types will now each get a value class.
> - These 8 classes will look as much like user-defined types as
> Object
> does.
> - They, like Object, will have a "cheat" in their source code that
> no
> one else gets to use. (Object's is that there is no implied
> `extends
> Object` or `super();`; these need no fields because the data they
> store is
> magically handled by the VM. These feel like similar cheats.)
I don’t disagree with any of the above, but I think the value classes
live in B2 not in B3. The B3 types are derived from the B2 types, by
“dumping out” the class fields. Note that every single B3 type
(non-reference) has a unique companion B2 type (reference). The
semantic difference between those types is like the semantic difference
between `int` and `Integer`. Narrow but useful.
Separate question: Does the declaring form for a B3/B2 type pair
“look like” a B2-only declaration, but with an added mode switch?
Or does it “look like” a B3-declaration, something that’s not a
full-on class-that-defines-objects? We could go either way on that.
Either way, one declaration will define two related types.
Suppose we have this B2-only class declaration syntax:
```
__ByValue class NamedInt { String name; int value; … }
```
Then a B2-tilted syntax for a B3/B2 pair might look like:
```
__ByValue __AlsoPrimitive class Point { double x, y; … }
```
And a B3-tilted syntax for the same pair might look like:
```
__ExtendedPrimitive Point { double x, y; … }
```
(F.D.: I think the B3-tilted syntax is less likely to succeed.)
Either way, you can draw out a B3 type from the first and a B2 type from
the second.
As a sort of mental experiment, you can also imagine a “two headed”
declaration syntax that would provide independent specification of the
names of both types:
```
__PrimitiveType int & /*int is B3*/
__PrimitiveBox class Integer /*int.ref=Integer is B2*/
extends Comparable<Integer> {
… one body with two heads …
}
```
Why do that? Well, it makes it clear that a one-headed declaration
could in principle start with either the B3 or the B2 end of the stick.
Also it helps us think, a little, about retrofitting the very odd legacy
wrapper names.
>
> Then mopping up the rest:
>
> - Existing classes probably need a term like "reference classes"
> (in the
> model I'm going to circulate that doubles down on
> values-are-not-objects,
> then this wants to be "object classes", even though that feels
> weird at
> first).
> - I think the term for bucket 2 classes really ought to center on
> identitylessness, e.g. "noid", "noident", "idfree", or something.
> Anything
> else is getting away from the essential meaning of the bucket;
> plus, we
> want people to call bucket 1 classes "identity classes", don't we?
If we spend the good word “value” on B3, we must then find a word
like “noid” for B2. But since I think “value-ness” is centered
in B2 from the start, I’d rather find a one-off term for B3! (And
that’s “primitive” as argued above.)
But let’s grant, for a moment, that we don’t want “value” for
B2. What term characterizes B2 types? As you say, they are objects but
they don’t have identity, so “noid”, etc. That’s a true
description. But it’s not the main point of B2 types. The point of
B2 types is not that we dislike object identity (we like it a lot in
many cases!). The point of B2 types is they can be regarded as tidy
bundles of field values, and/or tidy abstractions (like `String`) of
simple values, without confounding state changes. After looking at this
from many angles, I prefer to say that, while B2 has the *negative*
characteristic of being identity-free, it has the *positive*
characteristic of being *freely copyable*. The “freely” is so free
that copying often happens outside of the JVM heap. In fact, a B2 type
is a value.
Maybe there’s a different way of characterizing the *positive* nature
of B2, but I think it comes down to, “B2 types are plain values”.
Until I get an even better account for B2’s special power (one that
doesn’t begin with the word “not” or “no” or “doesn’t”),
I’m going to be very happy to declare B2 types as “value classes”
and work with their instances as “value objects”.
So, while I see why you want to avoid the paradox of “extended
primitives”, and your very correct identification of “values” in
B3, I prefer to talk about B3 as primitives (primitive values) and B2 as
value objects.
BTW, I agree that B3 values should not be objects; maybe we can call
them instances, although instance/class/object are terms that usually
appear together. Obviously both B1 and B2 contain
instances/classes/objects.
BTW again, I updated my own Zoo of Field Types diagram here, and you
might wish to give it a look, since it’s relevant to this discussion:
http://cr.openjdk.java.net/~jrose/values/type-kinds-venn.pdf
(that’s cr.openjdk.java.net/~jrose/values/type-kinds-venn.pdf if the
URL police got the previous line)
> Footnote: for a more concrete manifestation of this problem: I am sure
> we
> cannot possibly get away with Class.isPrimitive() being true for these
> classes. Right?
Yeah, `Class::isPrimitive` is a query on types, not classes. In other
words, the `Class` mirror, for this call, is serving to reflect a type,
for example one of `int.class` or `Integer.class`. If we apply the term
“primitive” to classes, then we will need a not-so-good name, like
`Class::isPrimitiveClass`. However, if we choose to make extended
primitives reflect very similarly to basic primitives, then we can
choose to have `Class::isPrimitive` to return true *for their
non-reference types*.
There is no reference type for which `Class::isPrimitive` is true.
Despite my fondness for the concept of “basic types” there is no
`Class::isBasicType`. There could be, in the future, though I don’t
think it pulls its weight. We could also have
`Class::isBasicPrimitive`. Or we could choose to break less code by
keeping `Class::isPrimitive` true only for nine mirrors, and define
`Class::isReferenceType` and/or `Class::isNonReferenceType` to provide
the query for ~~fruit~~ basic or extended primitive types.
More information about the valhalla-spec-observers
mailing list