Migrating methods in Collections

Wed Dec 16 17:18:27 UTC 2015

I'd like to start with the "What about Collections" discussion. While 
this topic is inextricably linked to all the other topics (value types, 
generic specialization, nullability, etc), to avoid getting overwhelmed, 
let's start by trying to keep our focus on the Collections API and the 
process of migrating it into an anyfied world.  So, I'll offer an 
ahead-of-time warning: there's going to be lots here that you will want 
to question, and to many of those questions I'm going to answer "OK, but 
let's come back to that later", because I'd like to stay focused for the 
time being on the Collection-related issues.  (Let's also try to create 
separate threads for separate discussion topics, though I know this is 
harder to remember to do.)

Most of these are already prototyped in some form in the Valhalla repos 
(which at least demonstrates that they are plausible.)

BackgroundMaterial
-------------------

The last "State of the Specialization" 
(http://cr.openjdk.java.net/~briangoetz/valhalla/specialization.html, 
Dec 2014) outlines an early approach towards the specialization 
challenges and features.  My talk at JVMLS 2015 
(https://www.youtube.com/watch?v=uNgAFSUXuwc) outlines the progress that 
had been made between last December and August.  The State of the Values 
(http://cr.openjdk.java.net/~jrose/values/values-0.html) outlines the 
basic model for value types.  If you've not read/viewed these, that's a 
good place to start.

Core Goals
----------

The key requirement that drove many of the design choices in Java 5 when 
adding generics is *gradual migration compatibility*.  This means it 
must be possible to evolve a non-generic class to be generic in a manner 
that is binary-compatible and source-compatible for both clients and 
subtypes.  So an existing client or subtype should continue to work 
whether it is recompiled or not, and generified or not, and it should be 
equally practical to generify subtypes and clients at the same time as a 
library class, or later, or never.

We adopt similar goals for "anyfication" -- anyfying a class should be 
binary- and source-compatible for clients and subclasses, and it should 
be possible to anyfy a generic class without requiring that either its 
clients or subclasses be anyfied, and whether or not they are recompiled.

Basic Model, Ultra-summarized
-----------------------------

Just as the requirement of gradual migration compatibility was a 
significant design constraint in Java 5 (and which pushed us strongly 
towards erasure), this requirement is going to be a significant 
constraint here.  Essentially, this means that parameterizations like 
ArrayList<String> need to continue to be erased (otherwise they'd not be 
binary compatible), but parameterizations like ArrayList<int> need to be 
reified (since we cannot erase int to Object without boxing, and if 
boxing were good enough, we wouldn't be bothering at all.)  This pushes 
us towards an interpretation of anyfied generics that is erased over 
reference instantiations and reified over value instantiations.

At the same time, there is code that is valid under the assumption that 
T <: Object that would not be valid if T can take on 'int'.  These 
include domain assumptions (e.g., assignment to null) and identity-based 
assumptions (a T can be used as a monitor lock.)  Accordingly, we cannot 
simply reinterpret existing reference-generic code to be any-generic; we 
need some indication from the user that they are opting into the 
broadened genericity domain (and therefore willing to accept the 
limitations of this broadened domain.)  Our working model for this is to 
annotate the declaration of a type variable with an "any" modifier 
(e.g., "class Foo<any T>").  We sometimes use the abbreviation 'tvar' to 
describe a type variable, and 'avar' to describe an any-type variable 
(similarly, 'ivar' for an inference type variable.)

Conceptually, the enhanced generics model is simple:
  - Non-anyfied generic code means exactly the same thing in Valhalla as 
it does in Java 5-9;

  - A reference instantiation of an anyfied generic class means exactly 
the same thing in Valhalla as it does under erased generics;

  - A class or method tvar declaration can be annotated with 'any', in 
which case it can range over value types (including primitives) as well 
as reference types.  There are some restrictions on what you can do to 
an avar-typed variable to reflect the fact that values may not be 
nullable, have no monitor locks, that they don't participate in a 
subtyping relationship with Object (instead, they have a boxing 
conversion to Object), and that a V[] is not a subtype of Object[];

  - There is a new wildcard type, Foo<any>, which is a supertype of both 
value and reference instantiations of Foo.

The enhanced model embraces erasure more explicitly than the current model.

Migration Challenges
--------------------

Ideally, we could just sprinkle 'any' over our codebase, and be done -- 
and hopefully for many classes, this is all that will be required.  But 
for some classes, there may be migration challenges, which come in two 
forms:

  - A method signature is fundamentally incompatible with anyfication 
(such as Map.get(), which returns null when the mapping is not present, 
which makes no sense if V=int);

  - The existing implementation appeals to assumptions of object-ness, 
and needs to be adjusted.

For the purposes of this memo, I'm going to restrict myself to the first 
category; we'll come back to the second category later.  Here's a 
more-or-less complete list of methods in Collections that may need some 
sort of migration help. A blank cell indicates "same as above."

*Class**
* 	*Method**
* 	*Issues**
*
Collection
	contains(Object)
	Assumes all Rs are castable to Object without boxing.

	remove(Object)

	removeAll(Collection<?>)
	Should take Collection<? extends E>. Collection<?> not a supertype of 
all instantiations.

	retainAll(Collection<?>)

	containsAll(Collection<?>)

	toArray()
	Returns Object[], which is not a supertype of V[] for any value type V.

	toArray(T[])
	Not strictly problematic, but relies on runtime checking that E <: T 
and on reflection for implementation.
List
	remove(int)
	Not strictly problematic, but if remove(Object) becomes remove(T), will 
be a confusing overload.

	indexOf(Object)
	Same as contains(Object).  Also, would be better as a lambda-consuming 
method, now that we have lambdas.

	lastIndexOf(Object)

Map
	containsKey(Object)
	Same as contains(Object)

	containsValue(Object)

	remove(Object)

	put(K,V)
	Returns what was there before, or null if there was no mapping.  Not 
strictly problematic, but doesn't project so well to non-nullable Vs.

	putIfAbsent, replace, computeIfAbsent

	get(K)
	Uses null to signal no mapping

	getOrDefault(Object, V)
	Same as contains(Object)
Queue
	poll(),peek() 	Uses null to signal no element
Deque
	poll(), peek()

	pollFirst(), pollLast()

	peekFirst(), peekLast()

	removeFirstOccurrence(), removeLastOccurrence()
	Same as remove(Object)

It has often been suggested that "maybe it is time for Collections 2.0"; 
some would like to declare this to be the end of the road for existing 
collections.  However, redoing collections is only part of the battle; 
the Collection types are riddled throughout the JDK and other libraries, 
so we would need a mechanism for migrating all the methods that 
consume/dispense collections, and if we were to build new collections, 
saddling them with compatibility with old collections would be a big 
stone to hang around their neck.  So I think this path is less appealing 
that it might first appear.  However, the techniques described here 
allow us to fix some of the errors of the past.

Partial Methods
---------------

Our approach is guided by the following observation: not all methods 
declared in a generic class Foo<any T> need be members of all 
instantiations of Foo; it is reasonable for some members to be 
restricted to certain instantiations. In particular, the instantiation 
Foo<all type vars instantiated with erased ref> is an important 
distinguished instantiation.

If a method m() in Foo<any T> is a member of all instantiations of 
Foo<T>, we say m() is a *total* method. Otherwise we call it a 
*restricted* or *partial* method.  For each method in an existing 
generic class to be anyfied, it could be made a total method in the new 
anyfied class, or it could be restricted to reference instantiations -- 
and either approach is fully compatible with existing clients and 
subclasses. This provides us a migration path to "leave behind" certain 
problematic methods (making them members only of reference 
instantiations), and also to add new methods as long as there is a 
default implementation at least for reference instantiations.

As a simple illustration of this approach, let's say that we want to 
effectively rename the method List.remove(int) to removeAt(int).  
Clearly we cannot simply take away remove(int); existing clients would 
fail to link.  Similarly we cannot simply add a new method removeAt(int) 
without a default; then existing subclasses would fail to compile.

What we do here is (using a strawman syntax) add a new total method, 
with a ref default, and demote the existing method to a partial method 
on ref instantiations:

interface List<any T> {
     // A new, total method
     void removeAt(int i);

     // demote the old method to ref-only
     <where ref T>  // Just a strawman syntax
     void remove(int i);

     // give the new method a default in terms of the old, for ref
     <where ref T>
     default void removeAt(int i) { remove(i); }
}

Now:
  - Clients of List<ref T> can invoke either remove(int) or 
removeAt(int), and both will do the same thing; this allows existing 
clients to (if they want) migrate away from the old method completely, 
since removeAt(int) is total, or keep using the old method, at their 
choice (IDEs can also provide migration help);

  - New clients of List<any T> can only use removeAt(int) -- they won't 
even *see* remove(int) when they ask their IDE to auto-complete -- and 
this is not a compatibility issue as there were no existing clients of 
List<any T>;

  - Existing (ref) subclasses of List<T> still see the same set of 
abstract methods, and therefore continue to work exactly as before;

  - When anyfying a subclass of List<T>, now you have to provide a new 
implementation for removeAt(int), but this is OK because *you* have 
decided to anyfy your class, and it is reasonable to expect some 
possible code changes in this situation;

  - Further, AbstractList can insulate most subclasses from this change.

This gives us at least two tactics for dealing with problematic methods:

  - Migration: leave the old method in the "ref layer", and create a new 
total method with a ref-default in the "any layer", as with the removeAt 
example above;

  - Abandonment: leave the old method in the ref layer, and do nothing 
else, which may be suitable if an alternate idiom already exists (e.g., 
we could abandon removeAll because now we can express removeAll(c) as 
removeIf(c::contains), without adding any new method.)

With the controlled-migration option, the primary impediment is that 
many of the good names are already taken.

These two tactics get us much of the way there, but not all the way 
there.  I'm going to close this note with where these tactics get us, 
and then open a separate note for some of the options for the remaining 
cases.  The following table shows some possibilities.  There's plenty of 
time to bikeshed on the names.

*Class**
* 	*Method**
* 	*Possible Approaches**
*
Collection
	contains(Object)
	Migrate to containsElement(E)

	remove(Object)
	Migrate to removeElement(E)

	removeAll(Collection<?>)
	Migrate to removeElements(Collection<? extends T>).
Alternately, abandon in favor of existing removeIf(Predicate<? super E>).

	retainAll(Collection<?>)
	Same

	containsAll(Collection<?>)
	Migrate to containsElements(Collection<? extends E>).

	toArray()
	Nothing good yet ...

	toArray(T[])
	Leave as is, or abandon in favor of new method 
toArray(IntFunction<E[]>), like Stream has.
List
	remove(int)
	Leave as is, or migrate to removeAt(int).

	indexOf(Object)
	Migrate to Optional-bearing findFirst(Predicate<? super E>)

	lastIndexOf(Object)
	Migrate to findLast(Predicate<? super E>)
Map
	containsKey(Object)
	Migrate to hasKey(K)

	containsValue(Object)
	Migrate to hasValue(V)

	remove(Object)
	Migrate to removeMapping(K)

	put(K,V)
	Leave as is; accept that we cannot distinguish between "nothing was 
there before" and "default value was there before" (as is true with null 
today.)

	get(K)
	Migrate to one (or all) of:
Optional<V> map(K)
mapOrElse(K, V)
tryMap(K, Consumer<? super V>)

	getOrDefault(Object, V)
	Migrate to mapOrElse, as above
Queue
	poll(), peek() 	Migrate to tryPoll(Consumer) *or* optional-bearing method
Deque
	poll(), peek()

	pollFirst(), pollLast()

	peekFirst(), peekLast()

	removeFirstOccurrence(), removeLastOccurrence()
	Migrate to predicate-accepting method, or simply migrate to new name

Notes:

For Collection.contains() and remove(), while it may be tempting to try 
and narrow the argument type from Object to T, this is likely 
problematic.  In the following case, we can't express something that 
seems pretty natural:

     Collection<Animal> animals = ...
     Collection<Dog> dogs = ...
     Dog d = ...

     // Can't say these
     for (Animal a : animals)
         if (dogs.contains(a)) ...
     animals.removeIf(a -> dogs.contains(a))

Its actually pretty convenient for contains/remove to accept anything 
that might be a member of the collection, but Object is not the top type 
we're looking for here (since values require boxing to convert to 
Object.)  So neither the status quo (accept Object) nor recasting to a 
T-accepting method is all that great. (But see next memo for an 
alternative.)  Its not clear whether all the other Object-accepting 
methods need the same treatment, or if migration to an E-accepting 
method is acceptable there.

For removeAll/retainAll, we probably just want to abandon in favor of 
the more powerful removeIf added in 8.  (Kevin/Louis -- would be good to 
get stats on actual usage of removeAll/retainAll.)

For Map.get(), it really sucks that the good name is taken but the 
existing signature is terminally polluted with nullness (even for 
non-null-supporting maps.)  We can migrate to multiple new methods (most 
of which have defaults in terms of the others):
     Optional<V> map(K k)
     V mapOrElse(K k, V defaultV)
     boolean tryMap(K k, Consumer<? super V>)

With Optional as a value type, there is essentially no cost (either 
footprint or invocation overhead) for using Optional over a naked 
reference.  So the first sig is not as problematic as it might look.  
The second has an obvious default in terms of the first.  The Optional 
version, though, has an issue: if the map can contain null values, 
there's no way to express that.  However, I think its OK if this method 
throws on null values -- the other three get-like methods (the two here 
plus legacy get()) can all express null values.  (So null-lovers don't 
get to enjoy the Optional goodness.  More motivation for them to give up 
on their nulls!)