diamond operator & implementation strategies (v3)

Fri Aug 21 11:07:22 PDT 2009

Hi
We have spent some time playing with the diamond operator[1, 6] - the
result of this work are two different prototypes, one implementing the
'simple' strategy discussed in my earlier post[2, 3], another one
implementing a more sophisticated approach that has been proposed by
Neal [4,5] (in the remainder of this mail I'll refer to the latter as
the 'complex' approach).

The goal of this work has been to evaluate the impact of both strategies
on existing code bases - in particular we were interested in knowing the
fraction of constructors that could take advantage of the diamond
operator and to see whether the two implementation strategies led to a
significant difference w.r.t. the applicability of the new diamond operator.

We have spent some more time performing benchmarks in order to evaluate
both implementation strategies. The results produced by the two
algorithms have been compared in order to check whether the inferred
type was still applicable to the given site. More formally, let S be the
generic type specified by the new expression in the pre-diamond code and
T the type inferred by the diamond inference algorithm for that
expression. There are 3 possible cases:

    a) T == S
    b) T != S, but T and S are interchangeable given the call site (e.g.
because T is a subtype of S)
    c) T != S, with T and S not interchangeable

a) and b) mean success, while c) means failure (the algorithm yields a
result which is not equivalent to the explicit type specified by the
programmer). In order to evaluate the fitness of a given implementation
strategy, we have instrumented the two prototypes in order to emit
detailed info about each new expression encountered by the compiler. In
particular, for any given new expression whose target type is a generic
type, both inference schemes have been applied and their results
compared to the original generic type specified in the source code. The
info generated by the two prototypes can be grouped into 3 categories
(roughly corresponding to the classification above):

*) "redundant" - this denotes a situation where generic type annotations
can safely be removed from a new expression (see case (a,b) above). In
order to provide more detailed info, We have splitted this category into
two sub-categories:

-) "redundant.normal" - this denotes a new expression appearing in an
assignment context, more specifically where the type inferred by the
diamond operator is a subtype of the expected type

-) "redundant.no.context" - this denotes a new expression appearing in a
place lacking of an assignment context, more specifically where the
inferred type is a subtype of the original generic type

*) "not.redundant" - this denotes a situation where there is no
guarantee that removing generic type annotations from a new expression
would yield to a sound expression (see case (c) above). In this case the
inference algorithm exploited by the diamond operator has inferred a
type T which is incompatible with both the expected type PT and the
original type G (the one including generic type annotations). Note: it
is possible that, by chance, the type inferred by the diamond operator
would type-check where it is used (e.g. because that type is the type of
an actual argument in a method call, where the corresponding formal is
Object).

*) "failure" - this denotes a failure of the inference algorithm. In
this case it is definitively not possible to omit the generic type
annotations, as the type inferred by the diamond operator is not valid
(e.g. violates some constraints specified by the type-variables'
declared bounds).

Here's a brief summary of the statistics regarding the applicability of
the diamond operator in three systems (see [7] for more detailed info):

*** OpenJDK ***

Overview:

new expressions: 104138
   new expressions w/ generic type: 5076 (4.87%)

Results:

redundant types (overall):  simple 4409 (86.88%) / complex 4533 (89.30%)
   redundant (assignment): simple 4353 (85.76%) / complex 4352 (85.74%)
   redundant (other contexts: simple 56 (1.1%) / complex 181 (3.57%)
non redundant: simple 662 (13.04%) / complex 536 (10.56%)
failures: simple 4 (0.08%) / 7 (0.13%)

*** Tomcat ***

Overview:

new expressions: 6048
   new expressions w/ generic type: 153 (2.53%)

Results:

redundant types (overall):  simple 148 (96.73%) / complex 148 (96.73%)
   redundant (assignment): simple 148 (96.73%) / complex 148 (96.73%)
   redundant (other contexts: simple 0 (0%) / complex 0 (0%)
non redundant: simple 5 (3.38%) / complex 5 (3.38%)
failures: simple 0 (0%) / complex 0 (0%)

*** Netbeans ***

Overview:

new expressions: 94768
   new expressions w/ generic type: 12010 (12.67%)

Results:

redundant types (overall):  simple 10670 (88.84%) / complex 11085 (92.30%)
   redundant (assignment): simple 10628 (88.49%) / complex 10610 (88.34%)
   redundant (other contexts: simple 42 (0.35%) / complex 475 (3.96%)
non redundant: simple 1338 (11.14%) / complex 907 (7.55%)
failures: simple 2 (0.02%) / complex 18 (0.15%)

The numbers are relatively stable across all benchmarks - to summarize,
around 90% of all new expression whose target type is generic can be
simplified using diamond (regardless of the chosen implementation
strategy). The remaining 10% correspond to situations in which some type
have been inferred - but there is no guarantee that the code would
type-check (e.g. because the inferred type is not a subtype of the type
in the code).

No approach looks noticeably better than the other; the simple approach
is weaker when it comes to infer types in an argument position - on the
other hand the additional amount of types inferred by the complex
approach is relatively low, and often the complex approach is less
stable - that is there are more cases in which complex inference fails
without inferring any type (this problem could be fixed, in principle,
by improving standard javac type-inference, but such a task is outside
the scope of the project Coin).

The area in which the complex approach is stronger is where a class has
recursive bounds; in such cases there's almost no way for the simple
approach to infer a correct type, while the complex approach usually can
rely on the type of some actual arguments in order to infer the right
type (obviously this doesn't apply if the constructor is a no-arg
constructor). Here's an example (extracted from OpenJDK):

from DefaultMXBeanMappingFactory.java

private static <T extends Enum<T>> MXBeanMapping
            makeEnumMapping(Class<?> enumClass, Class<T> fake) {
        return new EnumMapping<T>(Util.<Class<T>>cast(enumClass));
//would work with complex but not with basic
    }

[...]

private static final class EnumMapping<T extends Enum<T>> [...]

On the other hand there are situations in which the complex approach
infers a type which is too specific and is thus incompatible with the
LHS. Because of that, the amount of failures in the complex approach is
slightly higher. Consider the following sample snippet (extracted from
OpenJDK):

from Snapshot.java

    private SoftReference<Vector> finalizablesCache;

    [...]

    Vector<JavaHeapObject> finalizables = new Vector<JavaHeapObject>();

    [...]

    finalizablesCache = new SoftReference<Vector>(finalizables); //would
work with basic but not with complex

    [...]

As a final notice, it can be seen that collection classes are the
'killer app' for the diamond operator. Moreover the amount of raw
constructor calls where the target type is a raw collection is
relatively high across all benchmarks [7] (the 'most generified source'
award goes to NetBeans).

 From our analysis, we conclude that the simple approach should be
preferred, as the extra complexity required by the complex approach does
not entirely pay off. As said above, this is because the scope of
javac's method type inference is limited (e.g. no return type-inference
is applied when a generic method call occur in argument position).
Because of that, the complex approach is able to deliver only a small
improvement over a simpler and straightforward approach. Moreover, our
initial concerns about code refactoring seem to be confirmed by the
numbers given in the benchmarks - it is possible to notice how, across
all benchmark, the simple approach gives better results in straight
assignment contexts (except for the Tomcat benchmark, where the two
prototypes give the same results). In order to improve the behavior of
the complex approach in assignment contexts some non-trivial changes to
javac's method type inference are required (e.g. perform a single round
of type inference where all constraints - formals and return type - are
considered at once). Again, such changes are risky, and definitively
outside the scope of project Coin.

Here's a brief outline of the JLS changes that would be required in
order to support the simple approach:

15.9 (grammar)

This production:

"ClassBodyopt:
         Primary. new TypeArgumentsopt Identifier TypeArgumentsopt
(ArgumentListopt) ClassBodyopt"

Should be changed as follows:

"ClassOrInterfaceType:
         Primary. new TypeArgumentsopt Identifier
TypeArgumentsOrDiamondopt (ArgumentListopt) ClassBodyopt

TypeArgumentsOrDiamondopt: TypeArguments
                           <>"

Moreover, section 15.9.1 (Determining the Class being Instantiated)
should be adjusted so that the following paragraph is added:

"If the type denoted by ClassOrInterfaceType has an empty type argument
list (diamond operator), then the following steps must be applied in
order to infer the actual type of the class being instantiated:

*) Let C be the name of the (generic) class being instantiated, and X1,
X2... Xn the formal type arguments declared by class C. Moreover, if the
instance creation expression occurs in a context where it will be
subject to assignment conversion (§5.2) to a type S, then let G be C<X1,
X2 ... Xn>,

*) Consider the following initial set of constraints:

     * the constraint S >> G
     * additional constraints Bi >> Ti, where Bi is the declared bound
of Xi,

*) The above constaints are used to infer constraints on the type
arguments using the algorithm of section (§15.12.2.7). Any equality
constraints are resolved, and then, for each remaining constraint of the
form Xi <: Uk, the argument Xi is inferred to be glb(U1, ..., Uk) (§5.1.10).

Any remaining type variables that have not yet been inferred are then
inferred to have type Object"

[1]
http://mail.openjdk.java.net/pipermail/coin-dev/2009-February/000009.html
[2] http://mail.openjdk.java.net/pipermail/coin-dev/2009-May/001720.html
[3] http://cr.openjdk.java.net/~mcimadamore/6840638/webrev.0/
[4] http://mail.openjdk.java.net/pipermail/coin-dev/2009-March/000075.html
[5] http://cr.openjdk.java.net/~mcimadamore/6840638/webrev.1/
[6] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6840638
[7] http://blogs.sun.com/mcimadamore/resource/diamonds_statistics.pdf

Maurizio