Project Lambda: Java Language Specification draft

Sat Jan 23 00:02:43 PST 2010

On Jan 22, 2010, at 2:55 PM, Alex Buckley wrote:

That's a very good cut at a complex problem; thanks.

> * Issues
> 
> - The draft
>  tries to take no position on when lambda expression evaluation
>  physically occurs.

+1 It's a similar situation to Integer.valueOf, but with more degrees of freedom.

Basically, if you can get away with hoisting a closure up a few scopes (because no inner scopes are captured), the compiler should be free to do so.  The only way to catch the "cheat" is to ask about object identities.  Such questions should be defined as having indeterminate answers. Note that the ECMAScript spec. very carefully allows such hoisting for JavaScript closures.

Phrasing this in terms of "when evaluation occurs" may not be the best formula, though.  How about saying evaluation occurs at the normal time, but if two evaluations would produce functions with identical behavior, the system is free to return the same function object both times.

BTW, JSR 292 is likely to support a very efficient implementation for statically definable method handles, using the 'ldc' bytecode.  This combined with partial application is a promising implementation technique for function objects.

> - As
>  background reading to why loop variables should not be shared, see
>  http://blogs.msdn.com/ericlippert/archive/2009/11/12/closing-over-the-loop-variable-considered-harmful.aspx.

What a tale of woe, and all the details are there!  I added a comment explaining why Java avoided this problem, by taking Lisp's DOTIMES into account.  I hope we can continue to avoid the problem. :-)

If we want to be kind to our programmers, there is probably a way to hack a 'final' keyword into the iteration variable of a C-style 'for' loop.  That way the familiar 'final' rule will allow iteration variables to be captured.  I'm thinking of something like this:

for (final int i = 0; i < limit; i++) {S;}
==>
for (int i$var = 0; i$var < limit; i$var++) {final int i = i$var; {S;}}

This is sugar for the workaround that C# programmers have to apply by hand.  The special pleading is that if a 'final' variable is (otherwise illegally) modified in the 'for' header (only), then the variable is split and the 'for' body is wrapped as above with a truly 'final' copy of the variable.  If we don't do this, programmers will hack @Shared into their 'for' variables.  They will then take the same consequences as the C# programmers (except with a little more warning, which is good).

>  Note that a lambda in C++0x requires the programmer to enumerate the
>  variables in lexical scope accessed by the lambda expression.

I personally find that notation repellent.  It is only necessary for a language in which storage is managed by hand.  Any other benefit of explicitly calling out uplevel references is outweighed by the noisiness and increased friction for refactoring transformations.

> - This document avoids new rules by proposing a
>  specific lambda invocation syntax. See the discussion in-line.
+1 for dot-paren; it reads OK, because dot is a connective (in Java).
-1 for "pling", that excitable little non-connective.

> - Lambda conversion: ...Need a way to disambiguate.

Disambiguation is easy, for the user:  Use a SAM-type cast on the offending argument.  Given that fact, after applicability is determined (including SAMwise lambda conversions) it would be fine to give up on multiple applicable methods, report an error, and require a cast from the user.

It seems that lambda conversion is like an autoboxing conversion.  Perhaps it deserves to be a second-class conversion?  That way if there is a matching overload that takes a function type (or Object) then SAMs would not be considered at all.  This would allow some overloadings to have function types and others to have SAM types, and would provide a way to select the function types (with or without a cast).  If lambda conversion is not second-class, perhaps you can have problems with SAM types hiding plain function types, even though the SAM types are not subtypes of the function types.  Example:

  void m(#void() x);
  void m(Runnable x);
  ...
  m(#(){});  // ambiguous: neither x is a subtype of the other
  m((Runnable) #(){});  // more information leads to success.
  m((#void()) #(){});   // does this help?  Only if lambda conversion is disabled by the cast.

If SAMwise lambda conversion were allowed on arbitrary function expressions (not just lambda expressions), an autoboxing-like treatment of overloads would be required.

By the way, if #()() means the same as #(){}, I don't see why the former has to be supported.  It just adds corner cases (as Neal pointet out).  Plus, like any periodic string, it is tricky to partially parse, hence to read at a glance.  Finally, it looks like Kilroy with an ear-phone.  :-)

> - Function types: better syntax is desirable.

+1 I've never liked the C-style prefix syntax for function types R(A).  Note that C++0x (though for their own specific reasons) is considering the classic postfix function type syntax (A)->R, so C as a precedent will probably not support prefix syntax.

Higher-order function notations would look like R(A)(B) vs. (A)->(B)->R.  The latter makes the order of application much clearer, which presumably is why it is the norm for typed functional languages.

I understand that postfix arrow makes parsing harder.  Put another way, the prefix return type is a parsing crutch.  Surely we can do without it.

> - Class literals: Can you get the class literal of a function type,
>  e.g. #int(int).class ? We don't yet know the implementation of
>  function types, so the Class object underlying a lambda expression
>  is unknown. Deferred.

This, like object identity of functions, is something to be permanently coy about.  Allowing #int(int).class guarantees an infinitude of function classes, and a correspondence to #(int x)(x+1).getClass().  This forces implementors into a class-per-function-type scheme.

If JSR 292 is used, or if some other erasure technique is used which is parsimonious with classes, then #(int x)(x+1).getClass() will return some uninteresting system class, and lose information.  Meanwhile, #int(int).class would be some other system class, representing an erased function type.

I think it is better to define the runtime representation of (reified) function types independently of java.lang.Class.  Besides implementation flexibility, the main advantage of this would be completely accurate representation at runtime of all argument types, with no erasure.  This implies in turn that function references would support a runtime 'type' attribute independently of whatever class they happened to have.

> [15.8.3 this]

-1 on this with Neal and Stephen for similar reasons.

Some examples use 'this' for self-recursion, as in JavaScript.  But there are more standard and clearer ways to write a self-recursive closure: just bind it to a name!

A special binding for 'this' is a flaw in transparency, with the usual problems of refactoring friction and hard-to-read code.  It's similar to (the first half of) my beef with binding 'return' in lambda blocks, which is on record elsewhere.

Suggestion:  Either leave 'this' alone, or else (if we think users will refuse to distinguish closures from inner classes) make unqualified 'this' illegal, requiring an explicit disambiguating type prefix as 'Foo.this'.

> If the body of a lambda expression is a block, then either all or none
> of the return statements in the block must have an Expression. If no
> return statement has an Expression, then the body of the lambda
> expression is void, i.e. has no type. If all return statements have an
> Expression, then the types of the Expressions must be
> assignment-compatible with each other...

An explicit return type here would (a) remove a lot of messy and buggy specification but (b) not add greatly to verbosity.  If your closure is going to need a block, it already is more complex than an expression, and therefore heavy enough that an explicit return type is a relatively small incremental addition.

Instead of your example:
  #() { if (false) return 5; throw new Error(); } // has type #int()

Here's the sort of thing I mean:
  #()->int { if (false) return 5; throw new Error(); } // has type #()->int

Or, FTR, here's a version in which the return value gets a name instead of 'return' getting overloaded:
  #()(int x){ if (false) x=5;  else  throw new Error(); } // has type #()->int

> - If the body of the lambda expression is void, then T is void,
>  indicating no return type; otherwise, T is the type of the body of
>  the lambda expression after capture conversion.

Can a lambda expression which is a call to a void method be counted as a void body?  If so, then any expression can be a lambda body expression.  Otherwise, some body expressions will be illegal because of type, which may be surprising to users.

In general, I appreciate the attention you give to the various 'void' and 'Void' corner cases.  It's laborious, but will pay off when it's time to work with function type conversions.  Same point for 'Nothing', unfortunately.

> Any local variable, formal method parameter, or exception handler
> parameter used but not declared in a lambda expression must be
> effectively-final.

The effectively-final permission (and @Shared, if any) should be extended to inner classes, too.  It would be a compatible extension.

As a lower-priority issue, it would be good to use Neal's work on safely-copyable variables, which generalizes effectively-final variables in useful ways.

> /*
> ...Unfortunately, one member or the other will still be
> obscured in this case:
> 
>  class C {
>    #int(int) f = #(int x)(x);
>    int f(int x) { return x; }
>  }

If that's fixable with 'f.invoke(0)' it's also fixable with 'f.(0)', as a disambiguation of 'f(0)'.

But I think we can live with requiring the extra dot.  (WIth method handles and reflection, we're used to 'x.invoke()', so plain 'x.()' seems like a relief.)

> Specifying that every function type has an invoke() method is
> plausible but concision is moderately important for lambdas, so it's
> not my preferred option.

(BTW, there should be a reflectable name for function invocation, if possible, so the usual f.getClass().getMethod() boilerplate can work.)

> Because 'this' has function type, it may be used as a receiver:
>  #int(int) factorial =  #(int i)(i == 0 ? 1 : i * this!(i - 1));

Even after getting rid of the 'this' overload, we can still write something even more readable:
 #int(int) factorial =  #(int i)(i == 0 ? 1 : i * factorial.(i - 1));

> [8.4.1 Formal Parameters]
> 
> [[Replace all occurrences of "method" with "method or lambda
> expression". This sets up the scope of a lambda expression's formal
> parameters correctly, and allows variadic lambda expressions.

+1 for ellipsis in functions; varargs helps with combinator building

The variadics are nice, and I think are worth fighting for, but the types are tricky.  Are the ellipsis '...' part of the function type or are they erased?

What's the relation between #void(Object...) and #void(Object[])?  #void(Object...) and #void(Object,Object)?  #void(Object...) and #void(Object)?

I suggest that conversions between varargs and non-varargs be allowed but that there be no subtyping between the cases above.  Cf. SAM conversion.

-1 for erasure of ellipsis or any other aspect of function types, except perhaps throws, which are checkable only statically.

> * Types, Values, and Variables
> 
> [4.3 Reference Types and Values]
> 
> There are *four* kinds of reference types: class types, interface
> types, array types, *and function types*. *Class types, interface
> types, and array types* may be parameterized with type arguments.

+1 for disjointness.  This implies that a class cannot extend or implement a function type.

At runtime, getClass on a function will have to return something, but (as noted above) there should probably be a different way to reflect on the (unerased!) function type.

> FunctionType:
>  '#' ResultType '(' [Type] ')' FunctionThrows_opt

(No ellipsis here, yet, but this is where it would count!)

> A function type that is void (i.e. has no return type) and has formal
> parameter types P1..Pn is a supertype of a function type #T(S1..Sn)
> iff Si is a supertype of Pi (i in 1..n).

As Neal points out, this conversion, as well non-bitwise conversions of primitive types, will require some implementation complexity.  Some of these conversions are better understood, I think, as assignment conversions (or even casts) akin to boxing, and not as subtyping.  Subtyping guarantees (I think) that reference identity is guaranteed not to change during a conversion that uses only subtype relations.  In any implementation (with inner classes or method handles) there will be some conversions that will require wrapper creation, hence will not work well as subtyping relations.

This means that subtyping relations defined at the language level should be relatively sparse, and some useful conversions should be supplied by other means.  But, AFAIK, both inner classes and method handles could be tweaked to support return value elision as a subtyping move.  I don't think this is true of non-bitwise primitive conversions like int-to-float, unless there is some special JVM support.

> * Conversions

I suggest a checked-cast conversion which goes beyond the subtyping rules.  Although it's not statically correct, it is still useful to make conversions like #Integer(Integer) to #Object(Object).  The JVM provides ways to make this safe (an explicit cast) and generics even force the use of such casts at times.

A checked-cast conversion would (if necessary) create a new function object wrapping the old one, which would perform the necessary runtime casts on the fly.  In JSR 292 code, we need to do this sort of thing all the time, so that algorithms can be written in terms of erased types, and then applied to functional values of more specific types.

Since this conversion would imply casts being applied to arguments or return values, it should have a syntax at least as explicit as a cast (from a function expression to a function type).

-- John