Wildcards and raw types: story so far

Wed Apr 20 21:55:16 UTC 2016

Valhalla's treatment of wildcards and raw types have taken a somewhat 
circuitous path.  Here's a brief history.  This is mostly at the level 
of compiler translation and classfile representation; the surface syntax 
and language type system are only briefly touched on.

Notation:
  - R(T) = D indicates that the compiler represents a compile-time type 
T using the runtime type descriptor D
  - Class[C] represents a CONSTANT_CLASS constant pool entry (or its 
descriptor equivalent, "LC;")
  - Foo<raw> -- the language-level type "raw Foo", written this way for 
clarity.

For example, prior to Valhalla, R(Foo<raw>) = R(Foo<?>) = R(Foo<String>) 
= Class[Foo].

Model 1
-------

Model 1 had no support for wildcards at all.  The argument was that 
List<int> and List<String> were totally different types, and mapped to 
totally different runtime classes.  This approach is not absurd on its 
surface; C++ and C# do this.

In this model, the existing wildcard and raw types were frozen at their 
current meaning: Foo<?> is interpreted as Foo<? extends Object> (as it 
always has been), so we could continue to use wildcards / raw types in 
combination with erasure, but didn't extend them beyond that point.

This approach had one significant advantage: it was possible to build a 
specialization prototype on the VM we actually had -- which was no small 
thing.  But the disadvantages soon became obvious:

  - It was a poor match for existing generic code, which is full of 
sloppy "cast through raw" to get around limitations (sometimes of the 
code itself, sometimes of the type system.) In particular, attempting to 
port Collections was pretty much a failure.

  - It was unpopular.  Despite wildcards being one of the Most Hated 
things about Java, apparently the only thing hated more was threatening 
to take away the wildcards.

  - It was confusing.  People expect Foo<?> to mean "any instantiation 
of Foo", but that's not what it meant.

For binary compatibility, we were tied to maintaining the same R-mapping 
for existing types (Foo<String>, Foo<raw>, Foo<?>), but we can use a 
different mapping for new types.  For specialized types, we used (for 
simplicity of prototyping) a name-mangling scheme, where R(Foo<int>) = 
Class[Foo${0=I}].

Model 2
-------

Model 2 built on the Model 1 translation approach, but added support for 
some new wildcards.  The existing types Foo<?> and Foo<raw> remained 
frozen at their current meaning; a new wildcard Foo<any> was added.

This approach simulated the wildcard type Foo<any> with an interface.  
So for a class Foo<any T>, in addition to generating the classfile 
Foo.class, it also generated an interface Foo$any.class, with 
R(Foo<any>) = Class[Foo$any].

Wildcards exist as a top type for all possible parameterizations of a 
generic type, so for a class Foo<any T> extends Bar<T>, we need:

     Foo<T>   <: Foo<any>  for all T
     Foo<any> <: Bar<any>

The classfiles generated by the compiler reflected these relationships 
(mostly).

Each member of Foo<any T> had a corresponding member in Foo$any.  For 
methods, we took the "anyrasure" of the method signature, where:

     anyrasure(T) = erasure(T)
     anyrasure(Foo<T>) = Foo<any>
     anyrasure(T[]) = Arrayish<any>

Arrayish is a new type that is injected as a supertype of existing array 
types.  (This was not implemented with Model 2; just a plan on paper at 
the time.)

For fields, Foo$any acquired getters (and for non-final fields, setters) 
whose signature were similarly transformed through anyrasure.

So, for a class:

     class Foo<any T> extends Bar<T> {
         T t;

         T a() { ... }
         Foo<T> b() { ... }
         T[] c() { ... }
     }

we would generate:

     interface Foo$any extends Bar$any {
         Object get$t();
         void set$t(Object o);
         Object a();
         Foo$any b();
         Arrayish$any c();
     }

     class Foo implements Foo$any {
         T t;
         bridge Object get$t() { return maybeBox(t); }
         bridge set$t(Object o) { t = maybeUnbox(o); }

         T a() { ... }
         bridge Object a() { return maybeBox(a()); }

         Foo<T> b() { ... }
         bridge Foo$any b() { return (Foo$any) b(); }

         T[] c() { ... }
         bridge Arrayish$any c() { return (Arrayish$any) c(); }
     }

The bridge methods implement the corresponding members in Foo$any.  This 
approach worked enough that we could anyfy Collections and Streams 
acceptably well.

In the happy cases, this worked well:

  - Subtyping relationships in the language are properly mirrored as 
subclassing relationships in the JVM, so that checkcast, instanceof, 
reflection, and verification "just work".
  - Methods without avars in their signature require no boxing when 
invoked through a wildcard/raw receiver (though still pay the itable 
overhead.)

However, representing wildcards as interfaces had a number of drawbacks 
when we get to the less happy cases:

Impersonation.  There's nothing to stop someone from just implementing 
Foo$any, thereby impersonating some instantiation of Foo, but might not 
be seen to obey Foo's invariants.  (This is even worse if Foo is a final 
class.)

Nonpublic members.  Interface methods are public; classes can have 
methods of any accessibility.  (Private members are even worse than 
protected/public as private members are not inherited; modeling them as 
interface members could create strange shadowing artifacts.  Similarly, 
virtualizing fields risks certain shadowing anomalies.)

Non-any superclasses.  In the following case:

class Bar { }
class Foo<any T> extends Bar { }

We can lift the members of Bar onto Foo$any, but we won't be able to 
model the subtyping relationship that Foo<T> <: Bar for all T.  This 
flows into array subtyping; the user will reasonably expect that 
Foo<any>[] <: Bar[], but we don't have a way to model this.

Multiple avars.  If a class has multiple avars, then there are 
theoretically O(2^n) wildcard types and each method could require O(2^n) 
bridges.  This has a big footprint cost, as well as burdening startup 
with code that will be rarely used.  (The Model 2 prototype erased all 
partial wildcards to a total wildcard, reducing the overhead back to a 
constant, at the cost of some potentially unnecessary boxing.)

Each of these has a potential answer, but the abstraction is starting to 
get pretty leaky.

Model 3
-------

Model 3 is a complete overhaul of the translation story (replacing the 
ad-hoc and highly complex and brittle Model 1 story with constant pool 
forms that allow us to express parameterization, including erasure), but 
only an incremental improvement to the wildcard story.  Specifically, it 
improves member-access; rather than lifting the methods and field 
accessors onto methods of Foo$any, we instead access them through indy.  
(We can do this because there is no existing code that uses Foo<any>.)  
This means the wildcard interfaces are super-simple (no members), the 
bridge method explosion is eliminated, and some of the interface-imposed 
restrictions (notably, those having to do with accessibility) can be 
handled more directly.

Still, the resulting story is unsatisfying.  We still need additional VM 
help (impersonation, non-any-generic superclasses, etc.)

And we still haven't addressed the mismatch of having separate meanings 
for Foo<any> and Foo<?>.  Not only is this confusing, but when we fold 
this in with instanceof/cast, we get something really bad...

The obvious interpretation of "x instanceof Foo" is "x instanceof 
Foo<raw>".  (Because, the next thing the user is going to do, is cast to 
Foo<raw>.)  But this means that the following code will break when you 
anyfy the declaration and don't adjust the implementation:

class Box<T> {
     T t;

     public boolean equals(Object o) {
         if (!(o instanceof Box))
             return false;
         Box other = (Box) o;
         return (t == null && other.t == null)
             || t.equals(other.t);
     }
}

If we add an "any" in front of T and do nothing else, then the equals 
method will silently fail for (say) Box<int>.  This is terrible.

So, we are (still) not there yet.

Looking Ahead
-------------

Our conclusion (which we mostly suspected from the beginning, but the 
experiments have borne out the details) is: simulating wildcard types 
with the classfile tools we have today will yield a decidedly 
dissatisfying simulation.  If we want to support wildcards, we'll need 
some VM help.  And we need to make some progress towards bringing 
non-erased instantiations into the raw type / wildcard fold.

This is not unreasonable; if we're adding parametric polymorphism to the 
VM, and we want a type system that supports wildcard/raw types, the VM 
should understand this type too -- all the defects above come from 
trying to "fake out" the VM. The success of Model 3 was about not faking 
out the VM, but providing a means of discussing parameterized types 
within the VM type system -- success for wildcards will come from this 
vector as well.

We have some new ideas.  Stay tuned.