Primitives in Generics proposal.

Tue Mar 9 13:28:50 PST 2010

As Stephen Colebourne already suggested, the various problems with function
types can be sidestepped rather simply by sticking purely with (SAM) type
names, immediately fitting any lambda to an appropriate SAM type. However,
to feasibly go this route, generics _need_ to be able to work with primitive
types, in order to reduce ParallelArray's 132 SAM types in Ops down to a
nice lean 8.

I've written up a proposal to get the discussion going. If indeed the
existence of such a proposal results in a much simpler Project Lambda, then
the time saved implementing Project Lambda can be used to implement this
idea. It should go without saying that allowing primitives in generics would
be a great feature for JDK7 on its own.

The canonical version that will be updated and is nice on the eyes can be
found here:

http://projectlombok.org/primitive_generics/

but in the interests of saving all discussion in one thread I've pasted the
markdown source of v1.0 below.

Happy reading!

# Primitives in generics proposal

v1.0

by Reinier Zwitserloot

## &sect;1 Preamble and necessity of the feature

The goal of this proposal is to allow primitives in generics. To simplify
the proposal, `int` is used in all examples and in all text. All the other 7
primitive types are of course also allowed as type arguments. In particular,
a closure strategy for JDK7's [Project Lambda][] to always _box_ any closure
into a so-called SAM type is a far more realistic option if the used SAM
types can accept primitives in their generics.  (a type that contains
exactly 1 abstract method, usually an interface. For example,
`java.util.Comparator` is a SAM type. So is `java.lang.Runnable`). For
example, [jsr166y-extra - Parallel Arrays][jsr166yx], the standard usecase
example for Project Lambda, currently lists 132 SAM interface types in its
[Ops](
http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/extra166y/Ops.html)
container class, but most of those are simply to provide a primitive-based
version of a few central concepts. If primitive generics were possible, this
list of 132 would be reduced to a much more manageable set. To be precise:

    Ops.Op<A, R>
    Ops.BinaryOp<A, B, R>
    Ops.Predicate<A>
    Ops.BinaryPredicate<A, B>
    Ops.Procedure<A>
    Ops.Generator<R>
    Ops.Reducer<A> extends BinaryOp<A, A, A>
    java.util.Comparator<A>
    Ops.Action

These names are more meaningful than their equivalent purely structural
function type constructs. For example, jsr166yx includes the concept of a
`Ops.Reducer` even though the signature of a reducer is 100% equivalent to
the more general `Op` class. It is self-evident then that the designers of
jsr166yx find this kind of purely name-based distinction important enough to
dedicate 4 of Ops' 132 SAM types to it even though they didn't have to.

Furthermore, as the inclusion of `java.util.Comparator` foreshadowed,
sticking with names for function types fits better with existing java code.
jsr166yx as it stands uses `java.util`'s version of Comparator for its
Object based `ParallelArray` but defined its own versions for the
primitives, such as `Ops.LongComparator`. If function types were introduced
instead, then jsr166yx would obviously replace its `Ops.LongComparator` with
`#int(long, long)` but it would face a dilemma for its `Object` based
`ParallelArray`: Should it take `#int(T, T)`, should it take
`java.util.Comparator<T>`, or should it take both? None of these three
options are fully consistent. This isn't a rare coincidence; the amount of
SAM types that are in active use across the java community is staggering,
and introducing function types to solve the primitives problem is going to
introduce a lot of inconsistency as APIs evolve to accept closures but keep
their old SAM based methods for backwards compatibility.

The solution is as clear as it is obvious: Make primitives allowable as type
argument. Together with a relatively minor and proven (by JRockit which
already includes it) JVM optimization, both the API and the performance
issues can be solved.

As a secondary step, one could start questioning if, once this feature has
been added, function types are even necessary. As recent traffic on the
[Project Lambda mailing list](
http://mail.openjdk.java.net/mailman/listinfo/lambda-dev) has indicated,
there are a host of issues with function types, from syntax ambiguities to
problems around allowing arrays of function types. By sticking with SAM
types, no new ideas are introduced for the java community to learn; they
presumably already know that for example the expression `new
Comparator<int>[10]` is not legal.

Last but certainly not least: Primitives as type arguments are a great
feature for the java language all by themselves and not only as a part of a
closure proposal. For example, if this proposal is implemented in JDK7, the
following could be allowed: `List<int> intList = new IntList();` and be
virtually as fast as a purely `int` array based list implementation such as
the ones that can be found in [GNU Trove](
http://trove4j.sourceforge.net/javadocs/gnu/trove/TIntArrayList.html), but
which is API compatible with `java.util.List`.

This v1.0 of the proposal does not highlight which sections of the JLS are
involved or need to be changed. Its primary intention is to start a
discussion, explore the feasibility, if it has clear solutions for all
potential issues, and if it can lead to a simplification of Project Lambda,
thus freeing up enough time in the JDK7 schedule to implement it.

[Project Lambda]: http://openjdk.java.net/projects/lambda/ "Project Lambda
Mailing List"
[jsr166yx]: http://gee.cs.oswego.edu/dl/concurrency-interest/ "jsr166y -
Parallel Arrays"

## &sect;2 Proposal

### &sect;2.1 Type Variables

Type variables are the `T` in `public class Foo<T> {}`. Type variables can
only include actual types in their bounds, such as `<T extends Number>`. As
`T extends int` doesn't really convey a useful construct, we simply won't
allow it. This means type variables themselves don't have to change at all.

### &sect;2.2 Type Arguments

Type Arguments come in two flavours: Just the type, as in: `List<String>`,
or as a bound, such as `List<? extends Number>`. As primitive types don't
really have subtype relations, there is again not much sense in allowing
either `List<? extends int>` or `List<? super int>`. It is therefore not
allowed; *only* `List<int>` would be legal. To be precise, `List<? extends
Set<int>>` would be allowed, but `List<Set<? extends int>>` would not be.

### &sect;2.3 Syntax Sugar

The JVM barely supports generics (instances don't preserve their type
arguments, and method signatures don't either), and changing the JVM is a
major undertaking so this proposal does not include any required changes to
the JVM. Therefore, Something like `List<int> x;` is effectively syntax
sugar similar to how `List<String> x;` is effectively syntax sugar.

### &sect;2.4 Legality of primitive type arguments.

A primitive can be bound to a type argument anywhere its wrapped type would
be legal. Therefore, `int` is valid when the bound is `T extends Number`,
but it would not be if the bound was, for example, `T extends InputStream`.

### &sect;2.5 The *typearg_int* imaginary type.

Anytime a primitive type is encountered as a type argument, the compiler
needs to keep track of the notion that the primitive type is used as type
bound. For example, the compiler will usually have to inject synthetic
methods that replace the primitive type with its wrapper type. To help
clarify the actions that the compiler needs to take, we'll call any `int`
that appears in a location where it is being used to fit a type variable as
the imaginary `typearg_int` type. This isn't a real type, just a virtual
construct of the compiler. This concept already exists in java. If, for
example, you write the following code:

    public abstract class Example implements List<String> {
        @Override public boolean void add(String x) {
            return true;
        }
    }

and inspect the class produced by javac, you'll notice 2, and not 1, `add`
method; you'll see `add(Object)` as well as `add(String)` in the generated
class file. The `add(Object)` simply passes the call to the `add(String x)`
method, which involves a type cast. For primitive types, this concept is no
different:

    public abstract class Example implements List<int> {
        @Override public boolean void add(int x) {
            return true;
        }
    }

would result in not just `add(int)` but also `add(Object)` which simply
passes the call on. In addition to a type cast (to `java.lang.Integer`),
this wrapper method will also auto-unbox the object.

### Sugar

#### &sect;2.6 Field Read

Anytime a field of type "int" is read, the compiler emits bytecode:

    GETFIELD/GETSTATIC "java/awt/Point" "x" "I"

However, if a field of type "typearg_int" is read, it is unboxed first, so,
the compiler emits:

    GETFIELD/GETSTATIC "java/awt/Point" "x" "Ljava/lang/Object;"
    INVOKEVIRTUAL "java/lang/Integer" "intValue" "()I"

In other words, the field is assumed to be of the erased typevar's type
(Object in the example, but could also be e.g. `Number`), is assumed to
contain an `Integer` object, and `intValue()` is called on it.

This act may result in the usual `ClassCastException` if heap pollution has
occurred, but can additionally also result in a `NullPointerException`, even
without heap pollution. This is analogous to auto-unboxing causing an NPE,
which is in the current Java6 also possible even if heap pollution has not
taken place. The java community has embraced autoboxing/unboxing even with
this shortcoming in the typing system, hence this similar case, which is not
readily solvable, can be considered the same way: A problem, but not a
show-stopper.

#### &sect;2.7 Field Write

Analogous to field read; instead of generating a `PUTFIELD` of type "I", the
erased typevar's type is used instead, and a call to `INVOKESTATIC
"java/lang/Integer" "valueOf" "(I)Ljava/lang/Integer;"` is inserted before
the `PUTFIELD` instruction.

Unlike the field read, there's no NullPointerException issue here.

#### &sect;2.8 Creating a field with a "typearg_int" type.

This isn't possible; Fields cannot be overridden, they can only be shadowed.
Fields *can* of course be defined as having a type variable as type, but a
type variable's actual type is unknown. For example:

    public class Foo<T> {
        T x;
    }

The `T` can be anything and is in practice erased to `java.lang.Object`.
This erasure will not change because of this proposal, and an erasure to a
primitive type isn't possible, because `<T extends int>` is not legal.

#### &sect;2.9 Method invocation

When invoking a method where one or more parameters is of type
`typearg_int`, then during the creation of the bytecode to load the
parameters on the stack, an `INVOKESTATIC` to `Integer.valueOf(int)` is
generated to box the primitive. Similarly, the signature generated in the
`INVOKEVIRTUAL`/`INVOKESTATIC` call is of the erased typevar's type and not
`I`.

If the invoked method's return type is of type `typearg_int`, the return
type in the signature is as usual replaced with the erased typevar's type.
Unless the return value is discarded because the method invocation appears
as expression statement, an `INVOKEVIRTUAL` call to `Integer.intValue()` is
inserted immediately after the `INVOKEVIRTUAL`/`INVOKESTATIC` instruction.

#### &sect;2.10 Method declaration (Preamble)

This part of the spec gets somewhat complicated, but this complication is
not new! For example, imagine the following two types:

    public class A<T> {
        void foo(T in) {}
    }

    public class B extends A<String> {
        void foo(String in) {}
    }

The current javac compiler solves this scenario by generating a synthetic
wrapper with the signature: `void foo(Object)`, as `A`'s `foo` method has
that erased type (`Object` is `T`'s type bound in class `A`). This synthetic
method wraps the call to `B`. You can actually see this in action by looking
at the stack trace you get when you throw an exception from `B.foo` and then
calling it via: `A<String> x = new B(); x.foo("");`

The current java gets even more complicated if the supertype contains both
`void foo(T)` and `void foo(String)`, which is of course legal as `T` erases
to `Object`, not `String`. When defining `void foo(String)` in `B`, *BOTH*
methods are overridden. Furthermore, trying to call `foo("")` on a variable
of type `A<String>` is not possible; a *this call is ambiguous* error is
generated. This error cannot be avoided by casting the argument; only by
casting the receiver (an instance of `A`) to a raw type can `foo` even be
called in such a circumstance.

The same problems occur when primitives are legal type bounds for type
variables, and the same solutions suffice as well. There is some friction
here, as the above java puzzler-esque exercise shows, but I'm not aware of
any proposal for e.g. Project Coin to adress this problem, and thus one can
safely conclude that this oddity is not that important.

#### &sect;2.11 Method Declaration (parameter)

Let's say we have:

    public abstract class IntList implements List<int> {
        public boolean add(int v) {return false;}
   }

In this case, both the actual method is generated, as well as a synthetic
method where every `typearg_int` is replaced with `Ljava/lang/Object;` (the
erased type of typevar), i.e. both of these:

    add(I)Z
    add(Ljava/lang/Object;)Z

are generated, with the second method typecasting its argument to `Integer`,
unboxing it, and calling the first. Both `ClassCastException` and
`NullPointerException` can occur as part of this operation, but the former
only if heap pollution has taken place.

Note there is no risk of an explosion of method signatures when a large
number of `typearg_int` typed parameters show up. In a hypothetical method
`add(T1 a, T2 b, T3 c)` with all 3 parameters a type variable, when all type
variables are bound to a primitive type, then only two methods are
generated: all 3 as primitive, and all 3 as wrapper type. Entirely analogous
to the convoluted example at the beginning of this section, if this method
is accessed as a `List<int>`, the wrapper type version will be called, but
code that works directly with the `IntList` type will generate a call to the
primitive version.

#### &sect;2.12 Method Declaration (return type)

Let's say we have:

    public abstract class IntList implements List<int> {
        public int get(int idx) {return 0;}
    }

In this case, again two versions are generated:

    get(I)I
    get(I)Ljava/lang/Object;

Where `Object` comes from the erased type of this particular version of
`typearg_int`. The second method simply calls on the first, boxes the
result, and returns that.

If a method contains both `typearg_int` in its parameter list as well as its
return type, still only 2 methods are produced; one with all-primitives, and
one with all-wrappers. The wrapper will unbox parameters and box the
returned value.

### &sect;2.13 The JRockit gambit

It is virtually impossible to make something like this:

    List<int> list = new ArrayList<int>();

as efficient as an `int[]` without reification; `ArrayList`'s implementation
creates an `Object[]` to store the list's elements in, and once it does
this, the performance game is automatically lost - it must create an `int[]`
array to get performance that is comparable to the performance of a raw
`int` array. However, it couldn't possibly do this; when `ArrayList`'s
constructor is called, its type parameter has been erased, so `ArrayList`
simply cannot tell what kind of array it should make. However, all hope is
not lost here, as this can be made quite efficient:

     List<int> list = new IntArrayList();

or, to avoid up to 8 new classes in java.util, these specialized classes can
be turned into inner classes of ArrayList, and ArrayList can gain a new
"static" constructor:

    List<int> list = ArrayList.with(int.class);

The above example is poor man's reification: This way ArrayList's "with"
static method can in fact create a new IntArrayList by runtime-inspecting
the `int.class` parameter. However, even with the receiver (`IntArrayList`)
being specifically built with the `int` primitive type in mind, and the
caller (the one that declared `List<int> list`) also operating solely on the
primitives level, the above compilation strategy still means that the
following code:

    list.add(10);

results in the following bytecode:

    ICONST 10
    INVOKESTATIC java/lang/Integer valueOf(I)Ljava/lang/Integer;
    INVOKEVIRTUAL java/util/List add(Ljava/lang/Object;)Z

    .... execution moves to `IntArrayList`'s code

    INVOKEVIRTUAL java/lang/Integer intValue()I
    INVOKEVIRTUAL java/util/IntArrayList add(I)Z

    ....

The hotspot compiler will eventually inline the static call and the last 2
calls (as they are to small final methods), but nevertheless there's an
object on the heap involved (the boxed integer), and two calls, even if
inlined.

However, the JRockit VM already eliminates sequential box/unbox calls even
across invocations, which eliminates the heap object entirely. Together with
inlining the pass-through call from add(Object) to add(int) this call is
then just as fast as calling the *GNU Trove* `TIntArrayList`. See <
http://blogs.oracle.com/ohrstrom/2009/05/pulling_a_machine_code_rabbit.html>
for a treatise on how the JRockit VM handles eliminating sequential
box/unbox operations.

In situations where either the receiver (such as `java.util.ArrayList`) or
the sender (some generic code that created a new list of it's own type
variable `E`) is not working exclusively with `int` this optimization
obviously won't work, _but_, this optimization would be impossible without
at least reification, which is not feasible for JDK7.

### &sect;2.14 Generics Conversion

A type such as `Map<String, String>` can be treated as a `Map<String, ?
extends Object>` implicitly. The same property is granted to primitives; a
primitive will readily convert to its wrapper type, which includes
conversion to a (bound) wildcard as `Integer` has a place in the type
hierarchy. In other words, a `Map<int, boolean>` will implicitly convert to
a `Map<? extends Number, ?>`, for example. This leads automatically to a
dilemma; it means `List<int>` can be treated as a `List<Integer>`, and in
that way, `null` can be added to this list which isn't technically
compatible with primitives, and as a result, treating this list as a
`List<int>` again can result in a `NullPointerException`. This type
conversion feels like autoboxing/unboxing which suffers from the same
problem, and as covered this omission in the type system has been deemed
acceptable before. As has been stated, `int` as a type bound is legal
anywhere `Integer` would be, which implies that converting from for example
`List<Integer>` to `List<int>` is also legal.

### &sect;2.15 The benefit to the java community - recap

Even without the JRockit optimization (which can at any time be added
without requiring changes to this spec, which is focussed on the JLS and the
compiler, and not the VM), this proposal is highly beneficial to the java
community. Generified libraries virtually never offer primitive-optimized
versions. For example, there is no ArrayList-like class for primitive ints
in the java runtime; one would have to reach for third party libraries such
as GNU Trove to find them. The large amount of code duplication that is
required to produce a primitive-optimized version for all 8 primitive types
is also very inpractical. It is clear, then, that the java community has
decided to accept the performance penalty, and more problematically, the
readability penalty, by choosing to use for example `List<Integer>`. The
readily available alternative, arrays, are continuously problematic in new
proposals, as they don't work well together with generics and have different
variance rules.

This proposal would also help greatly in APIs which are designed with
primitive arrays in mind. For example, the java.util.Arrays class has a
utility method to sort an int[], but only by the natural order of integers.
There is no `Comparator` based method for `int[]` in `java.util.Arrays`;
there is only such a method for sorting `T[]`, and `T` can currently only
stand for non-primitive types. With this feature, the following code could
be legal:

    int[] array = {1, -3, 2, 8, 6};
    Arrays.sort(array, new Comparator<int> {
        public int compare(int a, int b) {
            a = Math.abs(a); b = Math.abs(b);
            return a < b ? -1 : a > b ? 1 : 0;
        }
    });

or, with closure support:

    Arrays.sort(array, #(int a, int b) {
        a = Math.abs(a); b = Math.abs(b);
        return a < b ? -1 : a > b ? 1 : 0;
    });

In other words, convenient syntax, fast execution, while *keeping* the
`Comparator` concept. Comparator cannot obviously be removed (it would not
be backwards compatible), and it would be inconsistent for the sorting APIs
to work with `Comparator` when sorting Objects, but to work with `#int(int,
int)` closures when trying to sort an array of primitive ints.

### &sect;3 Curses! What about arrays!

There's one case that this proposal has not yet covered: What happens when
`typearg_int` is used in an array?

Let's say we have:

    public class Foo<T> {
        protected T[] array = null;
        public abstract void method(T[] param);
        public abstract T[] returnType();
   }

What would happen if we tried to write:

    public class Bar extends Foo<int> {
        @Override public void method(int[] param) {
            this.array = param;
        }

        @Override public int[] returnType() {
            return param;
        }
    }

This question seems academic, in that generics and arrays almost never meet,
but that isn't quite true: `java.util.Collection` itself contains `public
abstract <T> T[] toList(T[] in);`. It is important to realize that because
arrays and generics are already convoluted when used together in java6, the
`T` in that signature is not related to the `E` of List itself (the `E` in
`public interface Collection<E>`!). For usage in method declarations the
wrapper methods can create a new array and copy each element,
autoboxing/unboxing on the way. This is mind-bogglingly inefficient, but it
is simple. In practice the authors strongly suggest the implementer of
`IntArrayList` write a specific method to get a straight `int[]` array
without this convolution which obviously cannot be optimized by the VM the
way a single `int` to/from `Integer` conversion can be. For array access,
the proposal suggests treating this concept as extremely rare and thus
coming up with the simplest thing that could possibly work: Access to the
array (read/write) is allowed and involves boxing as usual, but a reference
to `Foo.array` itself will simply remain its erased type and emits a
warning. Therefore:

    public abstract class Bar extends Foo<int> {
        void test() {
            Object[] o = array; //warning but legal
            int[] p = array;    //error
        }
    }

Nevertheless this concept introduces an API incompatibility: The javadoc for
`toList` states that if the input array has enough room, the same array will
be returned, and this cannot of course be done if the input array is first
number-for-number copied into a `Object[]` array, and then after the copy
operation has completed, this array is translated back to `int[]`. This
proposal offers no solution to this dilemma. It can only note that as `int`
is currently illegal in generics, it is technically backwards compatible to
add a note to the javadoc that explains the same array is *not* returned
even if it has enough room if the type bound is a primitive.

It also shows a gap in the API of a hypothetical `IntArrayList`: The
signature (as mandated by the `Collection` interface) must be `public <T>
T[] toArray(T[] in)`, and as `T[]`'s erasure, which is `Object[]`, is not
compatible with `int[]`, this method cannot be made fast even for a specific
subtype such as `IntArrayList`. This proposal offers no pragmatic solution
other than to create a custom method in `IntArrayList` as well as a static
utility method in `java.util.Collections` that can translate any list to
`int[]`, which uses an instanceof check to pick up
`java.util.ArrayList.IntArrayList`s special method and use that.

The second issue raised is, in `public <T> T[] toArray(T[] in)`, what
happens if the calling code ends up binding `int` to `T`? This is a
particularly painful process: Javac needs to generate the conversion from
`int[]` to `Integer[]` prior to calling the method, and then the method will
most likely convert back to `int[]`, and on the return phase the same double
convert occurs for a grand total of 4 conversions. To avoid accidental
triggering of this potentially very slow operation, a warning should be
emitted anytime the compiler generates conversion code to convert a
primitive array to a non-primitive array in order to fit with the erased
signature of an invoked method. (For fields, as mentioned, the arrays remain
their erased type, and an attempt to treat them as primitive array results
in an error).