Pattern features for next iteration

Tue Jan 19 19:59:00 UTC 2021

We're zeroing in on a feature list for the next iteration of pattern 
matching, which includes:

  - New kinds of patterns
    - Array patterns
    - Record deconstruction patterns, including varargs records
    - maybe: guard patterns (true(e), false(e))
    - maybe: AND patterns (P & Q), syntax TBD

  - New place for patterns: switch
    - Extend switch to support more types (all types?)
    - Extend switch on non-legacy types to support patterns as case labels
    - New nullity behavior (including case null)
    - New scoping + fallthrough
    - Refined exhaustiveness analysis

This document is an overview; the immediate goal is coming to consensus 
that this is the set of features we should be targeting next.  (Please, 
no deep-dives into details of, say, translation until this is locked down.)

#### Array patterns

An array pattern has the form
     T[] { CommaSeparatedListOfPatternsWithOptionalTrailingDots } 
OptionalIdentifier

The target type of an array pattern is T[], and it matches if:
  - the target is non-null
  - the target is an instance of T[]
  - the target has has an length equal to the arity of the nested 
pattern list, or, if the nested pattern list ends with a `...`, length 
greater than the arity of the nested pattern list
  - for each nested pattern P_i, the i'th element of the list matches P_i

It binds the union of the bindings of the nested patterns.  Like 
deconstruction patterns, it may optional bind a name to the target cast 
to the right type, if a name is provided:

     case T[] { P, Q } ts:

For multi-dimensional arrays, we interpret the pattern

     T[][] { P, Q }

as matching P and Q to the first and second elements of the outermost array:

     case T[][] { T[] a, T[] b }
or
     case T[][] { T[] { var a, var b }, T[] { var c, ... } }

Note that `T[] ts` is an ordinary type pattern, not an array pattern.

#### Record patterns

A record pattern has the form

     TypeName( CommaSeparatedListOfPatternsWithOptionalTrailingDots ) 
OptionalIdentifier

The type name must be a record class.  It matches if:

  - the target is non-null
  - the target is an instance of the specified record type
  - each of the nested patterns matches the corresponding component of 
the record, as served by the accessor method

The binding list is the union of the binding lists of the nested 
patterns; if the optional identifier is specified, it is bound to the 
target, cast to TypeName.

The treatment of null components is as per the matching rules for the 
nested pattern; for type patterns, if the type is total on (a supertype 
of) the type of the component, it is presumed to match, otherwise a 
dynamic (instanceof) check is performed.

#### Guard patterns

A guard pattern has the form true(BooleanExpr) or false(BooleanExpr).  
It matches if the boolean expression is true or false, respectively.  It 
matches any target type and binds no bindings.

#### AND patterns

Two patterns P and Q can be combined into an AND pattern denoted as P&Q 
(syntax provisional.)  P and Q must both be applicable to the target 
type for P&Q to be; the binding list is the union of the binding lists.  
If P does not match the target, no attempt is made to match Q.

#### Translation

I'll write VARS(P) to describe the binding variables (name and type) of 
a pattern P, and TR(target, P) to describe the compiler-tree desugaring 
of a pattern match expression.  When translating a pattern, its bindings 
are hoisted out to a suitable scope.  Javac has as "LetExpr" tree type, 
`LetExpr{S*} in e`, which means we execute statements S* and evaluate 
expression `e`, and its result is the result of evaluating `e`.  union_i 
is the union over a set indexed by i; product_i is the &&-product over 
the set.  |S| is the cardinality of a set.

Type patterns:
     VARS(T t) = { T t }
     TR(target, T t) = true   // for statically total type patterns, 
otherwise:
     TR(target, T t) = LetExpr {
                         boolean b = target instanceof T;
                         if (b) t = (T) target;
                       } in b

Record patterns:
     VARS(R(P*)) = union_i VARS(P_i)
     TR(target, R(P*)) = LetExpr {
                     boolean b = target instanceof T;
                     R res = b ? (R) target : null;
                 } in (b && product_i TR(res.comp_i(), P_i))

Array patterns:
     VARS(T[] { P* }) = union_i VARS(P_i)
     TR(target, T[] { P* }) = LetExpr {
                         boolean b = target instanceof T;
                         T[] res = b ? (T[]) target : null;
                     } in (b && res.length == |P*| && product_i 
TR(res[i], P_i))
     TR(target, T[] { P*, ... }) = LetExpr {
                         boolean b = target instanceof T;
                         T[] res = b ? (T[]) target : null;
                     } in (b && res.length >= |P*| && product_i 
TR(res[i], P_i))

Guard patterns:
     VARS(true(e)) = { }
     VARS(false(e)) = { }
     TR(true(e)) = e
     TR(false(e)) = !e

AND patterns:
     VARS(P&Q) = VARS(P) union VARS(Q)
     TR(P&Q) = TR(P) && TR(Q)

The above represents an unrolling of nested patterns, where t~P(Q) is 
unrolled into t~P(a) && a~Q (where t~P means "P matches t").

#### Binary compatibility

In the future, it will be possible to explicitly declare deconstruction 
patterns in records.  In order to make adding an explicit canonical 
deconstruction pattern a binary-compatible operation, by the time we 
exit preview, some of this logic should be funneled through a 
condy-based pattern runtime so that old binaries that ask for the 
canonical deconstruction pattern for a record will get a reasonable 
default if no explicit one exists, or the one in the class file if one 
does.  This entails replacing `res.comp_i()` in the translation with 
calls to the pattern runtime.

## Patterns in switch

Introducing patterns into case labels creates another round of 
adjustments for switch, and probably some new seams between old and new 
functionality.

#### More types

Switch is currently limited to integral primitive types and their boxes, 
strings, and enums.  We'll these types "legacy switch types", and call 
switches on legacy switch types, for which all of whose case labels are 
"old-style" case labels, "legacy switches". (I'll develop the list of 
candidate restrictions on legacy switches as I go; ideally this list 
should be empty, but it probably won't be.)

Pattern matching surely lets us switch over any reference type. It is an 
open question whether we should fill out the three remaining primitive 
types (boolean, float, double.)  On the one hand, it seems oddly 
incomplete to be able to switch over all but three types, while on the 
other, switching over floats seems pretty niche, and we already have a 
statement for switch over booleans (`if`).

Assuming we decide for completeness, there are two interpretations of 
switching on float: whether this is a legacy or pattern switch.

Note that for nested patterns, we need to have type patterns for all the 
primitive types, so we can match against `Foo(boolean b)` or `Bar(float 
f)`, even if we don't allow switching on these types.

#### Patterns as case labels

The main change here, of course, is allowing patterns as case labels.  
The specified pattern must be applicable to the static type of the 
switch operand.

One question we need to answer is how to integrate legacy case labels 
into the story.  There are basically three options here:

  - Retcon them.  In this approach, we define a constant literal to be a 
constant pattern.  Then, all switches can be interpreted as pattern 
switches.

  - Tolerated legacy.  With `instanceof`, rather than defining a bare 
type to be a pattern, we defined two forms of the `instanceof` 
production, a legacy one (`instanceof T`) and a pattern one.  We can do 
the same with case labels, and possibly then say that legacy case labels 
are only allowed in legacy switches.

  - Leave them to rot.  In this approach, legacy case labels are allowed 
on legacy switches only.

The difference between the first (retcon them as constant patterns) is 
whether literals are valid as patterns in contexts other than `case` 
labels (instanceof, nested patterns.)  I am steadily converging on the 
conclusion that constant patterns are an "obvious but wrong" idea, more 
"cute and clever" than sound language design.

Note that with a sensible guard feature, constant patterns become pure 
syntactic shortcut; `Foo(0)` can be expressed as `Foo(var x) & true(x == 
0)` in any pattern context.  That's not to say that the benefit of the 
shortcut is zero, but there's also a cost; further user confusion as to 
whether `id(0)` is an invocation of a method with a constant, or 
matching a pattern with a nested constant pattern.  (If we decide we 
really want constant patterns because the nestability leads to better 
readability, I prefer we find another way to denote them, such as `== 0` 
or `const(0)` or such.)

So, let's assume that we're going to be somewhere on the spectrum 
between "tolerate" and "rot".  There are two questions to answer here:

  - In switches on legacy types, do we allow a mix of constant cases and 
pattern cases?  (Note that this only makes a difference when we have 
more interesting patterns that can apply to these types, which mostly 
means waiting for declared patterns.)

  - In non-legacy-switches, do we allow constant cases at all?

If we say "yes" to either of these, then essentially we treat any 
pattern with a mix of these as a pattern switch, and lower `case C` to 
`case TargetType t & true(t.equals(C))` or similar.

#### Nullity

We've gone through an extensive discussion (actually, many extensive 
discussions) on nullity in switch.  It took several rounds to get over 
the "why can't we just always throw on null", but I hope we're there 
now.  It seems that the sensible strategy is:

  - At the top level of a switch, total type patterns, and the new `case 
null`, accept null, and no others.  (Default does not accept null, but 
you can fall out of a `case null` into a default.)  If a switch has no 
null-accepting patterns, then the switch throws NPE preemptively 
(simulating current behavior), as if there is an invisible trailing 
(possibly dead) `case null: NPE` in all switches.

This strategy is fully backward compatible; anything that throws today, 
will throw tomorrow.  Total patterns must come last, so only the last 
pattern, or an explicit `case null`, will see a null.

For nested patterns, we've already defined the semantics of pattern 
matching such that pattern matching never NPEs, and never introduces any 
failures to match null that don't derive from an underlying dynamic type 
test.  The problem is solely with expectations having to do with the top 
level of switch, based on historical preemptive null checks.

#### Scoping and fallthrough

Having decided to backtrack on merging binding variables in instanceof 
(`x instanceof Foo(var a) || x instanceof Bar(var a))`), the 
corresponding backtracking is to restrict fallthrough such that you 
cannot fall into, or out of, a case clause that has bindings.

This may require some adjustment in the scoping of binding variables, 
which is probably not yet fully defined for switches, to ensure that 
their scope ends at the next case declaration, so that we don't have a 
problem with reusing names:

     case Foo(var x): break;
     case Bar(var x): // OK to use `x` here

It is OK to fall out out of a case without bindings into a case without 
bindings, such as falling out of `case null` into `default`.

#### Exhaustiveness

There's a definition of a set of patterns being total on a target type, 
outlined here: 
https://github.com/openjdk/amber-docs/blob/master/site/design-notes/type-patterns-in-switch.md. 

The main new feature here is allowing pattern sets that are 
(optimistically) total on the target to be considered total in a switch 
expression, and insert the appropriate throwing catch-call to handle the 
remainder.

Secondarily, we still have a question to confront about how to make 
statement switches total like expression switches, so that users can opt 
into stricter type checking.

#### Translation considerations

Legacy switches on primitives can continue to get translated as they are 
now.

Pattern switches are lowered to densely numbered int switches, and we 
use an indy-based classifier to select the candidate case.  The 
bootstrap takes as its static argument list a list of Class objects (or 
null), and as its dynamic arguments, the switch target and the case 
number to start at.  For example, we translate:

     switch (o) {
         case Telegram t: A
         case Integer i && true(i > 0): B
         case Integer i: C
         case Foo(Bar(var x)): D
         case "moose": ...
     }

as (pseudocode):

     int index = -1;
     LOOP:
     while (true) {
         index = indy[bootstrap=SwitchBootstrap[Telegram.class, 
Integer.class, Integer.class, Foo.class, String.class]](o, index);
         switch (idx) {
             case 0: A
             case 1:
                 if (!(i > 0))
                     continue LOOP;
                 B;
             case 2: C;
             case 3:
                 if (!(fooBinding1 instanceof Bar(var x))
                     continue LOOP;
                 D
             case 4:
                 if (!(o.equals("moose"))
                     continue LOOP;
         }
         break LOOP;
     }

The classifier builds a decision tree based on types, but does not 
include nested patterns or guards -- these are added by the desugaring.  
If there is anything that causes us to enter a case and then later 
decide "whoops", we break out of the switch, and re-run the classifier, 
starting at the last index we tried. (Earlier experiments suggested this 
was the sweet spot; while we can accurately model patterns and guards as 
constant bundles of method handles, the complexity is high and the 
benefit is not so high.

For record/array patterns, we use the classifier to select the type, and 
then unroll any nested patterns into the switch body.  For ANDed 
patterns, we classify on the type of the first pattern and unroll the 
rest into the switch body.

For legacy switches on enums and strings, we have the option of 
replacing the current desugaring with an indy-based classification as we 
do with pattern switches, to move complex desugaring code from the 
compiler to simpler runtime code.  We can do this now, later, or never.

#### Open issues

We have a few open issues to sync on, though almost all can probably be 
kicked down the road past first preview:

  - Boundaries between legacy switches and pattern switches
    - allow mixing of constant and pattern cases?
    - allow switch on float/double/boolean with constant labels?
  - Allow statement switches to opt into totality?  How?