Pattern features for next iteration
Brian Goetz
brian.goetz at oracle.com
Tue Jan 19 19:59:00 UTC 2021
We're zeroing in on a feature list for the next iteration of pattern
matching, which includes:
- New kinds of patterns
- Array patterns
- Record deconstruction patterns, including varargs records
- maybe: guard patterns (true(e), false(e))
- maybe: AND patterns (P & Q), syntax TBD
- New place for patterns: switch
- Extend switch to support more types (all types?)
- Extend switch on non-legacy types to support patterns as case labels
- New nullity behavior (including case null)
- New scoping + fallthrough
- Refined exhaustiveness analysis
This document is an overview; the immediate goal is coming to consensus
that this is the set of features we should be targeting next. (Please,
no deep-dives into details of, say, translation until this is locked down.)
#### Array patterns
An array pattern has the form
T[] { CommaSeparatedListOfPatternsWithOptionalTrailingDots }
OptionalIdentifier
The target type of an array pattern is T[], and it matches if:
- the target is non-null
- the target is an instance of T[]
- the target has has an length equal to the arity of the nested
pattern list, or, if the nested pattern list ends with a `...`, length
greater than the arity of the nested pattern list
- for each nested pattern P_i, the i'th element of the list matches P_i
It binds the union of the bindings of the nested patterns. Like
deconstruction patterns, it may optional bind a name to the target cast
to the right type, if a name is provided:
case T[] { P, Q } ts:
For multi-dimensional arrays, we interpret the pattern
T[][] { P, Q }
as matching P and Q to the first and second elements of the outermost array:
case T[][] { T[] a, T[] b }
or
case T[][] { T[] { var a, var b }, T[] { var c, ... } }
Note that `T[] ts` is an ordinary type pattern, not an array pattern.
#### Record patterns
A record pattern has the form
TypeName( CommaSeparatedListOfPatternsWithOptionalTrailingDots )
OptionalIdentifier
The type name must be a record class. It matches if:
- the target is non-null
- the target is an instance of the specified record type
- each of the nested patterns matches the corresponding component of
the record, as served by the accessor method
The binding list is the union of the binding lists of the nested
patterns; if the optional identifier is specified, it is bound to the
target, cast to TypeName.
The treatment of null components is as per the matching rules for the
nested pattern; for type patterns, if the type is total on (a supertype
of) the type of the component, it is presumed to match, otherwise a
dynamic (instanceof) check is performed.
#### Guard patterns
A guard pattern has the form true(BooleanExpr) or false(BooleanExpr).
It matches if the boolean expression is true or false, respectively. It
matches any target type and binds no bindings.
#### AND patterns
Two patterns P and Q can be combined into an AND pattern denoted as P&Q
(syntax provisional.) P and Q must both be applicable to the target
type for P&Q to be; the binding list is the union of the binding lists.
If P does not match the target, no attempt is made to match Q.
#### Translation
I'll write VARS(P) to describe the binding variables (name and type) of
a pattern P, and TR(target, P) to describe the compiler-tree desugaring
of a pattern match expression. When translating a pattern, its bindings
are hoisted out to a suitable scope. Javac has as "LetExpr" tree type,
`LetExpr{S*} in e`, which means we execute statements S* and evaluate
expression `e`, and its result is the result of evaluating `e`. union_i
is the union over a set indexed by i; product_i is the &&-product over
the set. |S| is the cardinality of a set.
Type patterns:
VARS(T t) = { T t }
TR(target, T t) = true // for statically total type patterns,
otherwise:
TR(target, T t) = LetExpr {
boolean b = target instanceof T;
if (b) t = (T) target;
} in b
Record patterns:
VARS(R(P*)) = union_i VARS(P_i)
TR(target, R(P*)) = LetExpr {
boolean b = target instanceof T;
R res = b ? (R) target : null;
} in (b && product_i TR(res.comp_i(), P_i))
Array patterns:
VARS(T[] { P* }) = union_i VARS(P_i)
TR(target, T[] { P* }) = LetExpr {
boolean b = target instanceof T;
T[] res = b ? (T[]) target : null;
} in (b && res.length == |P*| && product_i
TR(res[i], P_i))
TR(target, T[] { P*, ... }) = LetExpr {
boolean b = target instanceof T;
T[] res = b ? (T[]) target : null;
} in (b && res.length >= |P*| && product_i
TR(res[i], P_i))
Guard patterns:
VARS(true(e)) = { }
VARS(false(e)) = { }
TR(true(e)) = e
TR(false(e)) = !e
AND patterns:
VARS(P&Q) = VARS(P) union VARS(Q)
TR(P&Q) = TR(P) && TR(Q)
The above represents an unrolling of nested patterns, where t~P(Q) is
unrolled into t~P(a) && a~Q (where t~P means "P matches t").
#### Binary compatibility
In the future, it will be possible to explicitly declare deconstruction
patterns in records. In order to make adding an explicit canonical
deconstruction pattern a binary-compatible operation, by the time we
exit preview, some of this logic should be funneled through a
condy-based pattern runtime so that old binaries that ask for the
canonical deconstruction pattern for a record will get a reasonable
default if no explicit one exists, or the one in the class file if one
does. This entails replacing `res.comp_i()` in the translation with
calls to the pattern runtime.
## Patterns in switch
Introducing patterns into case labels creates another round of
adjustments for switch, and probably some new seams between old and new
functionality.
#### More types
Switch is currently limited to integral primitive types and their boxes,
strings, and enums. We'll these types "legacy switch types", and call
switches on legacy switch types, for which all of whose case labels are
"old-style" case labels, "legacy switches". (I'll develop the list of
candidate restrictions on legacy switches as I go; ideally this list
should be empty, but it probably won't be.)
Pattern matching surely lets us switch over any reference type. It is an
open question whether we should fill out the three remaining primitive
types (boolean, float, double.) On the one hand, it seems oddly
incomplete to be able to switch over all but three types, while on the
other, switching over floats seems pretty niche, and we already have a
statement for switch over booleans (`if`).
Assuming we decide for completeness, there are two interpretations of
switching on float: whether this is a legacy or pattern switch.
Note that for nested patterns, we need to have type patterns for all the
primitive types, so we can match against `Foo(boolean b)` or `Bar(float
f)`, even if we don't allow switching on these types.
#### Patterns as case labels
The main change here, of course, is allowing patterns as case labels.
The specified pattern must be applicable to the static type of the
switch operand.
One question we need to answer is how to integrate legacy case labels
into the story. There are basically three options here:
- Retcon them. In this approach, we define a constant literal to be a
constant pattern. Then, all switches can be interpreted as pattern
switches.
- Tolerated legacy. With `instanceof`, rather than defining a bare
type to be a pattern, we defined two forms of the `instanceof`
production, a legacy one (`instanceof T`) and a pattern one. We can do
the same with case labels, and possibly then say that legacy case labels
are only allowed in legacy switches.
- Leave them to rot. In this approach, legacy case labels are allowed
on legacy switches only.
The difference between the first (retcon them as constant patterns) is
whether literals are valid as patterns in contexts other than `case`
labels (instanceof, nested patterns.) I am steadily converging on the
conclusion that constant patterns are an "obvious but wrong" idea, more
"cute and clever" than sound language design.
Note that with a sensible guard feature, constant patterns become pure
syntactic shortcut; `Foo(0)` can be expressed as `Foo(var x) & true(x ==
0)` in any pattern context. That's not to say that the benefit of the
shortcut is zero, but there's also a cost; further user confusion as to
whether `id(0)` is an invocation of a method with a constant, or
matching a pattern with a nested constant pattern. (If we decide we
really want constant patterns because the nestability leads to better
readability, I prefer we find another way to denote them, such as `== 0`
or `const(0)` or such.)
So, let's assume that we're going to be somewhere on the spectrum
between "tolerate" and "rot". There are two questions to answer here:
- In switches on legacy types, do we allow a mix of constant cases and
pattern cases? (Note that this only makes a difference when we have
more interesting patterns that can apply to these types, which mostly
means waiting for declared patterns.)
- In non-legacy-switches, do we allow constant cases at all?
If we say "yes" to either of these, then essentially we treat any
pattern with a mix of these as a pattern switch, and lower `case C` to
`case TargetType t & true(t.equals(C))` or similar.
#### Nullity
We've gone through an extensive discussion (actually, many extensive
discussions) on nullity in switch. It took several rounds to get over
the "why can't we just always throw on null", but I hope we're there
now. It seems that the sensible strategy is:
- At the top level of a switch, total type patterns, and the new `case
null`, accept null, and no others. (Default does not accept null, but
you can fall out of a `case null` into a default.) If a switch has no
null-accepting patterns, then the switch throws NPE preemptively
(simulating current behavior), as if there is an invisible trailing
(possibly dead) `case null: NPE` in all switches.
This strategy is fully backward compatible; anything that throws today,
will throw tomorrow. Total patterns must come last, so only the last
pattern, or an explicit `case null`, will see a null.
For nested patterns, we've already defined the semantics of pattern
matching such that pattern matching never NPEs, and never introduces any
failures to match null that don't derive from an underlying dynamic type
test. The problem is solely with expectations having to do with the top
level of switch, based on historical preemptive null checks.
#### Scoping and fallthrough
Having decided to backtrack on merging binding variables in instanceof
(`x instanceof Foo(var a) || x instanceof Bar(var a))`), the
corresponding backtracking is to restrict fallthrough such that you
cannot fall into, or out of, a case clause that has bindings.
This may require some adjustment in the scoping of binding variables,
which is probably not yet fully defined for switches, to ensure that
their scope ends at the next case declaration, so that we don't have a
problem with reusing names:
case Foo(var x): break;
case Bar(var x): // OK to use `x` here
It is OK to fall out out of a case without bindings into a case without
bindings, such as falling out of `case null` into `default`.
#### Exhaustiveness
There's a definition of a set of patterns being total on a target type,
outlined here:
https://github.com/openjdk/amber-docs/blob/master/site/design-notes/type-patterns-in-switch.md.
The main new feature here is allowing pattern sets that are
(optimistically) total on the target to be considered total in a switch
expression, and insert the appropriate throwing catch-call to handle the
remainder.
Secondarily, we still have a question to confront about how to make
statement switches total like expression switches, so that users can opt
into stricter type checking.
#### Translation considerations
Legacy switches on primitives can continue to get translated as they are
now.
Pattern switches are lowered to densely numbered int switches, and we
use an indy-based classifier to select the candidate case. The
bootstrap takes as its static argument list a list of Class objects (or
null), and as its dynamic arguments, the switch target and the case
number to start at. For example, we translate:
switch (o) {
case Telegram t: A
case Integer i && true(i > 0): B
case Integer i: C
case Foo(Bar(var x)): D
case "moose": ...
}
as (pseudocode):
int index = -1;
LOOP:
while (true) {
index = indy[bootstrap=SwitchBootstrap[Telegram.class,
Integer.class, Integer.class, Foo.class, String.class]](o, index);
switch (idx) {
case 0: A
case 1:
if (!(i > 0))
continue LOOP;
B;
case 2: C;
case 3:
if (!(fooBinding1 instanceof Bar(var x))
continue LOOP;
D
case 4:
if (!(o.equals("moose"))
continue LOOP;
}
break LOOP;
}
The classifier builds a decision tree based on types, but does not
include nested patterns or guards -- these are added by the desugaring.
If there is anything that causes us to enter a case and then later
decide "whoops", we break out of the switch, and re-run the classifier,
starting at the last index we tried. (Earlier experiments suggested this
was the sweet spot; while we can accurately model patterns and guards as
constant bundles of method handles, the complexity is high and the
benefit is not so high.
For record/array patterns, we use the classifier to select the type, and
then unroll any nested patterns into the switch body. For ANDed
patterns, we classify on the type of the first pattern and unroll the
rest into the switch body.
For legacy switches on enums and strings, we have the option of
replacing the current desugaring with an indy-based classification as we
do with pattern switches, to move complex desugaring code from the
compiler to simpler runtime code. We can do this now, later, or never.
#### Open issues
We have a few open issues to sync on, though almost all can probably be
kicked down the road past first preview:
- Boundaries between legacy switches and pattern switches
- allow mixing of constant and pattern cases?
- allow switch on float/double/boolean with constant labels?
- Allow statement switches to opt into totality? How?
More information about the amber-spec-experts
mailing list