[LW100] Specialized generics -- translation and binary compatibility issues

Wed Oct 17 19:38:44 UTC 2018

Number 2 of 100 in a series of “What we learned in Phase I of Project 
Valhalla.” This one focuses on the challenges of evolving a class to be 
any-generic, while interacting with existing erased code. No solutions 
here, just recaps of problems and challenges.

Let’s imagine a class today:

|interface Boxy<T> { T get(); void set(T t); } class Foo<T> implements 
Boxy<T> { public T t; public T[] tArray; public Foo(T t) { set(t); } 
public static<T> Foo<T> of(T t) { return new Foo(t); } T get() { return 
t; } void set(T t) { this.t = t; this.tArray = (T[]) new Object[1] { t 
}; } } |

and client code

|Foo<String> fs = new Foo<>("boo"); println(fs.t); println(fs.tArray); 
println(fs.get()); Foo<?> wc = fs; if (wc instanceof Foo) { ... } |

  * Foo extends Bar
  * instanceof/checkcast Foo
  * new Foo
  * anewarray Foo[]
  * getfield Foo.t:Object
  * invokevirtual Foo.get():Object
  * Method descriptors of |Foo::of|

We translate raw |Foo|, |Foo<String>|, and |Foo<?>| all the same way 
today — |LFoo|.

        Tentative simplification: reference instantiations are always erased

The specialization transform takes a template class and a set of type 
parameters and produces a specialized class. This can cause member (and 
supertype) signatures to change; for example, if we have

|T get() |

which erases to

|Object get() |

when we specialize with T=int, we’ll have

|int get() |

In theory, there’s nothing to stop us from specializing Listwith 
T=String. However, in the earlier exploration, we settled on the 
tentative simplification of always erasing reference instantiations, and 
only specializing value instantiations. This is a tradeoff; we’re still 
throwing away potentially useful type information (erasure haters will 
be disappointed), in exchange for much greater sharing, and avoiding 
some compatibility issues (existing generic code is rife with tricks 
like “casting through wildcards” to coerce a |Foo<A>| to |Foo<B>|, which 
only works as long as we erase; dirty tricks like this are often 
necessary as there are some things that are hard to express in the 
generic type system, even though the programmer knows them.)

Ignoring multiple type parameters for the moment, when |Foo| becomes 
specializable, our model is that it will have an /erased/ species — call 
it |Foo<erased>|. (If you ask it what its type parameters are, it will 
say “erased”. That is, we reify the fact that it is erased…) While 
migrating from erased to specialized generics requires source changes 
and recompilation at the generic class declaration, it should not 
require any changes or recompilation for clients. That means that legacy 
client classfiles that talk about |Foo| must be considered to be talking 
about |Foo<erased>|. (Hierarchies can be specialized from the top down, 
so it is OK to specialize |Bar| before |Foo|, but not the other way around.)

While the generic specialization machinery will have no problem with 
specializing to L-types, I think its a simplification we should hold on 
to, that we treat all L type parameters as “erased” for purposes of 
specialization.

        Additional simplification: let’s not worry about primitives

In Burlington, we concluded that as long as there’s a Pox class for each 
primitive, we can convert primitives to/from poxes through source 
compiler transforms, and not worry about specializing over primitives. 
Instead, when the user wants to specialize List, we instead specialize 
for int’s pox. Except for those pesky arrays … more on that later.

        Assumption: wild means wild

On the other hand, one of the non-simplifying assumptions we want to 
make is that a wildcard type — |Foo<?>| — should describe any 
instantiation of |Foo|, even when the wildcard-using code doesn’t know 
about specialization. (Same with raw usages of |Foo|.) For example, if 
the user has written a method:

|takeFoo(Foo<?> anyFoo) { anyFoo.m(); } |

in legacy (erased) code, we should be able to call |takeFoo()| with both 
erased and specialized instances of |Foo|. As we’ll see, this 
complicates member access, and really complicates arrays.

We will find utterances like

|invokevirtual Foo.get()Object getfield Foo.m:Object |

in legacy code; we want these to work against any specialization of |Foo|.

In the case where the instance is erased, things obviously have a decent 
chance of lining up properly, as the erased members will not have been 
specialized away. If our receiver is a specialized |Foo|, it gets 
harder, as the member signatures will have changed due to specialization.

Starting in Model 2, we handled this with bridge methods; for each 
specialized method, we also had an erased bridge. This is possible 
because there’s an easy coercion from |QPoint| to |LObject|. (There are 
other ways to get there besides bridges.)

        Wildcards

One of the central challenges of pushing specialization into the VM is 
how we’re going to handle wildcards. Given a generic class |Foo|, the 
wildcard type |Foo<?>| is a supertype of any instantiation |Foo<X>| of 
|Foo|. The wildcard type also erases to |LFoo|.

In Model 2, we modeled wildcards as interfaces, with lots and lots of 
bridges, but this still fell short in a number of ways: no support for 
non-public methods or for fields, and we had to deal with fields by 
hoisting them into virtual bridges on the interface.

Note that the wildcard subtyping also matters to the verifier, in 
addition to handling bytecodes; the verifier must know that any 
specialization of |Foo| is a subtype of the wildcard |LFoo|.

        But what does |LFoo| mean?

Careful readers will notice that we’ve been playing fast and loose with 
the meaning of |Foo|; sometimes it means the class, sometimes the 
wildcard, and sometimes the erased species.

The best intuition we’ve been able to come up with is:

  * There are /classes/ and /crasses/.
  * A crass describes a single runtime type; it has a layout, methods,
    constructors, etc.
  * A (template) class describes a family of runtime types.
  * A (template) class is like an abstract type; it has members and
    subtypes, but can’t be instantiated directly.
  * All the crasses derived from a class are subtypes of the class.
  * For purposes of instantiation, we interpret |new Foo| as creating an
    instance of the erased species, and a similar game with |<init>|
    methods.

    Model 3 classfile extensions

In Model 3, we extended the constant pool with some new entries:

*TypeVar[n, erasure].* This is a use of a type variable, identified by 
its index /n/. (There was a table-of-contents attribute listing all the 
type variables declared in a generic class or method, including those 
declared in enclosing generic classes or methods.) Since the erasure of 
a type variable is not merely a property of the type variable, but in 
fact a property of how it is used, each use of a type variable carries 
around its own erasure. For field whose type is |T|, the |NameAndType| 
points not to |Object|, but to |TypeVar[0, Object]|.

When specializing a type variable to |erased|, any uses of that type 
variable are replaced with the erasure in the |TypeVar| entry.

*MethodType[D,T…].* This is largely a syntactic mechanism, allowing us 
to represent method descriptors with holes (but also had the benefit of 
compressing the constant pool somewhat.) The parameter |D| was a method 
type descriptor, except that in addition to the existing types, one 
could specify |#| to indicate a hole; the |T...| parameters are CP 
indexes to other types (which could be UTF8 strings, or |TypeVar|, or 
the other type CP entries listed below.)

For example, a method

|int size(T t) |

would have a signature

|#1 = TypeVar[0, Object] #2 = MethodType[(#)I, #1] |

When specializing a |MethodType|, its parameters are recursively 
specialized, and then the resulting strings concatenated.

*ArrayType[T,rank].* This represents an array of given rank.

We found that as a template language, these types allowed exactly the 
sort of expressiveness needed, and specialized efficiently down to 
concrete descriptors (though in the M3 prototype, we had concrete 
descriptors of the form |List$0=I| to describe |List<int>|, obviously we 
don’t want that here.) But these designs captured all the complexity we 
needed (especially that of erasure), and allowed a mechanical 
translation int Java 8 classfiles.