Question about universal type variables

Mon Aug 8 17:25:10 UTC 2022

Let's try and separate the various things going on, and then we can see 
if there are attractive fictions we want to spin.  First, let's talk 
about kinds of polymorphism.  Cardelli and Wegner's "On understanding 
types, data abstraction, and polymorphism" (1985) divides polymorphism 
into a hierarchy (though these distinctions predate this paper):

Polymorphism
     Universal
         Parametric
         Inclusion
     Ad Hoc
         Overloading
         Coercion

Inclusion polymorphism is subtyping; the value set of String is included 
in the value set of Object.
Coercion polymorphism is conversions; we can use a `short` where an 
`int` is called for, because we can summon an `int` with the same value 
as the `short` at will.
Overloading refers to the fact that we can declare `f(int)` and 
`f(long)` so that at the use site, `f` appears to take multiple types.
(Pattern matching fits into ad-hoc polymorphism, but in Java it gets 
funneled through the other forms first.  Union types are another form of 
ad-hoc polymorphism.)

The special behavior of `null` could be explained by multiple paths:
  - Subtyping, with the Null type as a bottom type for all reference 
types, or
  - Ad-hoc, where it is understood that `null` is in the value set of 
every reference type, and treating an unbounded T as an (infinite) union 
type.

I think the latter path for explaining null is more useful in general, 
and it is probably closer to what the JLS actually says - this is how 
interfaces "inherit" members like Object::equals.  (I think it also 
offers a more useful way to talk about non-nullable reference types, but 
I'll come back to that.)

Java exhibits all of these forms of polymorphism.   Parametric and 
inclusion are on prominent display, but the others are there too, and 
coercion is actually quite relevant, both in general and to the point 
you are making which is about how the user thinks about what it means to 
instantiate a generic class with `int`.  I think we will need all of 
these tools to get to "everything is an Object" (which I think we agree 
is a desirable unification.)

A `String` is an Object through inclusion.  An `int` is an Object 
through coercion; if we have an `int` and we need an `Object`, we can 
box the `int` to `Integer`.  Today we do this only for assignment, but 
going forward we will do this in other contexts, such as member access 
(e.g., `1.toString()`, equality, array covariance, and serialization.  
We heal the multiple rifts through a combination of subtyping and coercion.

So, in the world of universal type variables, what is a T?  I claim it 
is a union over the set of all types that conform to T's bound. Today 
this includes only reference types, but once we extend bounds 
conformance to admit types that are convertible to T's bound, this 
includes value types as well.

This union offers a rational explanation for why we can say 
`t.toString()` -- because `toString()` is a member of every type in the 
union (when enumerate the members of a union type, you take the 
_intersection_ of the members in all the types in the union).  We leave 
it to the implementation as to how to actually dispatch `toString()`, 
which will be different depending on whether we specialize `Foo<T>` or 
not.  It also offers a rational explanation of why `T` has `null` in its 
value set today -- and why we are going to adjust this to generate 
unchecked warnings tomorrow -- because now we'll be intersecting in some 
types that don't have null.  The same is true for `synchronized` -- 
which has nothing to do with reference vs value, but with identity -- 
and again, we're now adding new types to the union that don't have a 
previously universal property.

The union model is based on the "stand in" model -- T can stand for some 
unknown type, so you can at most do things on a T that you can do on 
*all* the unknown types.  (Even when we get to specialized generics, we 
might still not allow all such operations, such as `new T[n]`; the union 
offers an upper bound on what is sensible, but languages can be more 
restrictive.)

The best way I've found to think about types like `String!` in Java is 
as _refinement types_.  (See Liquid Haskell 
(https://ucsd-progsys.github.io/liquidhaskell-tutorial/), or Clojure 
Spec (https://clojure.org/guides/spec)).  A refinement type takes a type 
and a _predicate_ which refines its value set, such as "even integer", 
and can contain arbitrary predicative logic.  The compiler then attempts 
to prove the desired properties (easier in functional languages).  In 
other words, the type `String!` takes as its base the reference type 
`String`, along with a predicate `s -> s != null`.  Taking away the null 
doesn't change the reference-ness of it, it just restricts the value set.

Interestingly, the languages that have the most direct claim to 
modifiers like `!` and `?` treat them as _cardinalities_, such as X# and 
to a lesser degree XSL.  In X#, where "everything is a sequence", 
cardinality modifiers are: refinement types!  They constrain the length 
of the sequence (imagine a refinement type on List which said "size() > 
3".)

We're clearly never going to plunk for arbitrary predicative logic in 
our type system and the theorem provers that come with them, but ad-hoc 
predicates like "not null", "has identity" and "is reference" are 
already swimming under the surface of the type system we have, and we'll 
see more like this when we get to specialization (where we will model 
specialized instantiations as refinements rather than substitution.)

OK, with that as background, let's dive into your mail.

> I'm sure the theoretic argument is fine as far as it goes, but it's 
> not much help for the end user. My issue is with the user model we 
> present to the world; what "useful fictions" are we securing for them, 
> that enable them to read and write code with confidence?

One locus of potential fiction is what we mean by "is" in "everything is 
an Object".  If a T is an Object, do we really just mean "things that 
are subtypes of Object", or do we mean "things that can be bounded by 
Object" (which includes value types via conversion/coercion, rather than 
via subtyping.)  I think ultimately the latter is more helpful, because 
when someone says `ArrayList<long>`, what they really want is an 
ArrayList that is backed by a long[], with all the non-nullability, 
flatness, and tearability that long already has.  `ArrayList<T>` can be 
thought of something that "has Ts" in it; if we are substituting in 
T=long, we will want all the properties of long because that allows for 
greater compositionality of semantics.

> *Some "T always a reference type" advantages:*
>
> * With subtype polymorphism, the user enjoys a solid understanding 
> that "reference types are polymorphic, value types are monomorphic". 
> As I'd put it: you can never have a value (say as a field) without 
> statically knowing its exact type, because its exact type governs the 
> shape and interpretation of the bits actually making up the value. 
> Don't know the exact type --> you need a reference. But 
> parametric polymorphism (thanks for laying out these terms in the JEP 
> draft, Dan) feels very similar! I'd expect the user to consult the 
> same intuitions we just drilled into them about subtype polymorphism. 
> It would be nice if the same simple rule held there too.

I think this tries to flip around "reference types are polymorphic" into 
"polymorphic types are references."   T is polymorphic, users will get 
that without trouble.  But does it have to be inclusion polymorphism?  I 
think it is an ad-hoc union between coercion (value types) and inclusion 
(reference types).

If we push towards the fiction of "they're all reference types", then 
Foo<long> really means Foo<Long>, with all the nullability and 
tearability differences between long and Long.

> * When my class gets used as `MyClass<int>`, I would get to reason 
> like so:
>     * When that code runs on some JVM that doesn't do specialization 
> yet, then my class gets used directly, so those `int`s are really 
> `Integer`s; of course they are, because T is a reference type. (I 
> expect I can't tear a value this way.)

I would say it differently: in this world, `long` *erases to* `Object`, 
just as `String` does.  Which means it will inherit some of the 
properties of Object that String doesn't have, such the chance for heap 
pollution.  Similarly, when we erase `long` to `Object`, we pick up some 
of these properties too, including the additional chance of null 
pollution, as well as some atomicity we didn't ask for.  But that's 
because of the erasure, not for any intrinsic property of type 
variables.  And the compiler will try to claw back some of that 
nullability with unchecked warnings anyway, just as we try to claw back 
some of the vectors for heap pollution. The nullity of T is the same 
erasure-driven pollution we already know and tolerate.

>     * When that code runs on some JVM that has specialization, then 
> different "species" of my class are being forked off from my template, 
> each one physically /replacing/ T with some value type. So /those/ are 
> value types, but once again T is still a reference type. (And here I 
> do expect tearing risk, for non-atomic types.)

When I specialize `Foo<long>`, any T-valued fields or arrays or method 
parameters really are long, with all the characteristics of long.  
Treating them as references (which have properties long doesn't have) 
seems more confusing.  "Placeholder, which collapses to its 
instantiation" feels more natural here?

> * If Java might ever to have non-nullable reference types, I suspect 
> it might immediately expose this whole type variable issue as having 
> been, at its essence, never really about ref-vs-val in the first 
> place. What it's really about is that there used to be one value in 
> the union of every Object type's value set, and now there isn't anymore.

Agree -- it was always about the union of types / intersection of 
properties of those types.  Null used to be in that intersection, but 
now things got more complicated -- but doesn't this argue against the 
reference interpretation, and towards the placeholder/union interpretation?

> * The best way a user can prepare their generic class for becoming 
> "universal" in the future is to adopt aftermarket nullness analysis 
> (such as I'm working on standardizing the semantics for in JSpecify). 
> They'll mark type parameters like `V extends @Nullable Object`, and 
> methods like `Map.get` will return `@Nullable V`. That will shake out 
> any obstacles up front. Then once V becomes a UTP, they'd just change 
> that `V` to `V.ref`, and they could presumably drop the `@Nullable` 
> too because `.ref` implies it (why else would it be used?). So the 
> language feature you're introducing for ref-vs-val universality is 
> immediately doing double duty, capturing nullness information for 
> reference types too.
>
> This would probably mean rethinking the `T.ref` syntax to something 
> that more closely evokes "T or null" (the fact this would, for an 
> <int> species, have to box to `Integer` in the process seems intuitive 
> enough).

Open to finding a better way to spell "T or null"; I think the path to 
this involves having this conversation converge :)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/valhalla-spec-observers/attachments/20220808/8a2171c5/attachment-0001.htm>