Primitive type patterns and conversions

Thu Mar 4 08:43:26 UTC 2021

On Mar 1, 2021, at 2:04 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
> Right now, we've spent almost all our time on patterns whose targets are reference types (type patterns, record patterns, array patterns, deconstruction         patterns).  It's getting to be time to nail down (a) the semantics of primitive type patterns, and (b) the conversions-and-contexts (JLS 5) rules.  And, because we're on the cusp of the transition to Valhalla, we must be mindful of both both the current set of primitive conversions, and the more general object model as it will apply to primitive classes.  

OK, good news and bad news:  On the plus side, this generally
makes sense and seems useful and powerful.   On the minus side,
I think you are listening to the wrong siren.

When we did JSR 292, we had to decide whether to embrace the
primitive conversions (in combinators like asType), or whether
to keep our distance.  For MH’s the gravitational pull came from
the language and reflection.  The language freely inserts widening
primitive conversions.  The reflective APIs also do this.  It turns
out that aligning MH conversions with those precedents was
a little tricky, but once the details were nailed down, the user
experience was probably better; you could refactor between
MH’s and either hand-written code or reflective code with
fewer surprises (in the handling of primitive conversions).

> If we focus on type patterns alone, let's bear in mind that primitive type patterns are not nearly as powerful as other type patterns, because (under the current rules) primitives are "islands" in the type system -- no supertypes, no subtypes.

The new question is, how many islands?  If there are fewer islands,
then bytes live in part of Int Island, for example.  If there are more,
then we have a different “42” on each island (except Boolean Island).

The JLS encourages us to think there are just integral and floating
mathematical values, that happen to fit in variously sized boxes.
It does this by asserting baldly that byte <: short <: ….  For some
meaning of “<:”, which approximates closely to set inclusion.

So I think there are 3-4 islands:  certainly boolean, certainly double,
probably long (because it has values not in double), maybe char
(because it is not fully numeric).  Or maybe the set long+double
is an island, which happens not to be a denotable type.
The JLS doesn’t give us grounds to distinguish two-word from
one-word primitives, but if it did int and float would also be islands.
As it is, I think they just include seamlessly into long and double.
The inclusion mapping is compiled to a i2l or f2d bytecode.
It’s not a bug that the same source value can have multiple
compiled representations.

> In other words, they are *always total* on the types they would be strictly applicable to, which means any conditionality would come from conversions like boxing, unboxing, and widening.  But I'm not sure pattern matching has quite as much to offer these more ad-hoc conversions.  
> 
> We have special rules for integer literals; the literal `0` has a standalone type of `int`, but in most contexts, can be narrowed to `byte`, `short`, or `char` if it fits into the range.

An unmarked literal behaves as if it has a one-point type, which,
under set inclusion, is a subtype of whatever standard types
happen to contain it.

The more intrusive phenomenon is byte <: short <: int <: ….

> When we were considering constant patterns, we considered whether those rules were helpful for applying in reverse to constant patterns, and concluded that it added a lot of complexity for little benefit.

Yes, that sounds right.  But if we were to add constant patterns,
I think we should try hard to embrace the JLS subtyping rules
for primitive numbers.  At least, I’m glad we did in JSR 292.

>  Now that we've decided against constant patterns for the time being, it may be moot anyway, but let me draw the example as I think it might be helpful.  

OK, switching gears to your hypothetical case…

> 
> Consider the following switch:
> 
>     int anInt = 300;
> 
>     switch (anInt) { 
>         case byte b:  A
>         case short s: B
>         case int i: C
>     }
> 
> What do we expect to happen?

Drawing on the precedents I mentioned, I think “anInt” is clearly
marked as “int”.  (With “switch (300)” the target value is unmarked,
but that’s a stupid use case.)  So we start with an integer value which
is in fact typed as “int”; the treatment of unmarked literals is a
red herring.

Next, we are asking to perform a dynamic type test, of an int
value against a byte case label.  If we take the JLS as its word,
then byte <: int, and it is legitimate to ask an instance of int
whether, in fact, it is also an instance of byte.  (This is the
“few islands” interpretation.)  Compare switch (anObject) {
cast String s: }, which makes an exactly parallel query of a
dynamic Object reference to ask if it happens to also be an
reference to String.

A dynamic type test can also be understood as an attempt
to cast a value, to see if something goes wrong.  If you
have an Object and cast it to (String) and get the object
back, you win; it matches a String.  Likewise, if you have
an int and cast it to (byte) and get the int back (tricksy,
but consistent), you win again; it matches a byte.
If the conversion either changes the value or throws
an exception (which changes the result in a different
way), the pattern match should fail.  I think this model
unifies pattern matching tightly to the rules for conversions
in both directions, since if a casts to b, then b (usually)
implicitly converts to to a.  When a cast would win,
the data can flow in all directions without loss.

> One interpretation is that `byte b` is a pattern that is applicable to all integral types, and only matches the range of byte values.  (In this interpretation, the second case would match.)

If we take the JLS at its word then this is a natural and unsurprising
interpretation.

> The other is that this is a type error; the patterns `byte b` and `short s` are not applicable to `int`, so the compiler complains.  (In fact, in this interpretation, these patterns are always total, and their main use is in nested patterns.)

At least the compiler will complain, so people who unwittingly rely
on the JLS rules won’t get surprised.  This is the “many islands,
no sharing” interpretation.

> If your initial reaction is that the first interpretation seems pretty good, beware that the sirens are probably singing to you.  Yes, having the ability to say "does this int fit in a byte" is a reasonable test to want to be able to express.

The siren singing to me is sitting in the middle of the JLS.  I think
I should listen to her.  I don’t care whether the behavior is useful
or not; programmers will use it or not as they see fit.

> But cramming this into the semantics of the type pattern `byte b` is an exercise in complexity, since now we have to have special rules for each (from, to) pair of primitives we want to support.  

As it was in JSR 292.

> Another flavor of this problem is: 
> 
>     Object o = new Short(3);
> 
>     switch (o) { 
>         case byte b:  A
>         case short s: B
>     }
> 
> 3 can be crammed into a `byte`, and therefore could theoretically match the first case, but is this really the kind of complexity we want to layer atop the definition of primitive type patterns?  

Nope.  We confronted this in JSR 292 also, and realized that the
JLS (bless its heart) didn’t require us to do all that much work.
When a reference object is present, then casting to a primitive
type P *always* entails a cast to the wrapper type W corresponding
to P, and then an unboxing operation, with zero widening or narrowing.

(This cast criterion is really just a standard exercise in P/E
pair manipulation.  The cast is P, the embedding E is an
inclusion or approximates one, and the condition is x=E(P(x)),
which detects whether x is identified with an element in the
range of P.)

Thus, applying the cast criterion above to your second example,
we find that “case byte” as applied to Object means “cast to Byte”
followed by “unbox the byte”.  This fails at the first step in the
example.  The JLS tells you what to do here, I think.  At least
it told us what to do in JSR 292, in the parallel case.

> I think there's a better answer: lean on explicit patterns for conversions.

Explicit is good.  Implicit is fraught; sometimes good
but when it goes bad it’s very mysterious.  Since it’s
a new part of the language, you can go for explicit,
even where corresponding parts of the JLS have already
opted for implicit.  But if you do that, you may fragment
the language needlessly.

> The conversions from byte <--> int form an embedding projection pair,

In fact, taking the JLS at its word, it is a special P/E pair where
the byte -> int arrow is an inclusion (an identity map on one
island), not just a mapping between islands.

> which means that they are suited for a total factory + partial pattern pair: 
> 
>     class int {
>         static int fromByte(byte b) { return b; }
>         pattern(byte b) fromByte() { ... succeed if target in range ... }
>     }

Yes explicit is good.  In fact it may be good enough to rationalize
what the JLS says about byte <: short <: int.  We can do this
by marking those embeddings that are to be treated as identity
embeddings:

class int {
        __IdentityEmbedding static int fromByte(byte b) { return b; }
        __IdentityEmbedding pattern(byte b) fromByte() { ... succeed if target in range ... }
    }

In some other languages, __IdentityEmbedding is spelled “implicit”.
I think “implicit” conversions can be defined which cannot be
understood as identity embeddings, and I suppose that it is
such “rude implicits” which give implicit a bad name.

> Then we can replace the first switch with:
> 
>     switch (anInt) { 
>         case fromByte(var b): A    // static or instance patterns on `int`
>         case fromShort(var s): B
>     }

Or if those conversions are identity embeddings
we can recover the earlier example as a shorthand.

> which is (a) explicit and (b) uses straight library code rather than complex language magic, and (c) scales to non-built-in primitive classes.  (Readers may first think that the name `fromXxx` is backwards, rather than `toXxx`, but what we're asking is: "could this int have come from a byte-to-int conversion".)  

(And we are also asking “if we cast this to a byte,
do we get the same value back from the byte?”)

> 
> So, strawman: 
> 
>     A primitive type pattern `P p` is applicable _only_ to type `P` (and therefore is always 
>     total).  Accordingly, their primary utility is as a nested pattern.

In other words, although reference type patterns respect
the reference type hierarchy, we choose to discard the
JLS-specified primitive type hierarchy.  Instead of identity
embeddings (which come from sub/super links), we require
ad hoc named embeddings all around.

You haven’t persuaded me, yet, that it’s all so very hairy.
JSR 292 survived and thrived through a parallel exercise
in adhering to the JLS primitive rules.  I don’t think you
would need a copy of the JLS primitive conversion rules
in the pattern matching spec; you just appeal to an
“as if cast” rule or some equivalent.  You wouldn’t
even need new kinds of conversions, I think.

> Now, let's ask the same questions about boxing and unboxing.  (Boxing is always total; unboxing might NPE.)  
> 
> Here, I think introducing boxing/unboxing conversions into pattern matching per se is even less useful.  If a pattern binds an int, but we wanted an Integer (or vice versa), then we are free (by virtual of boxing/unboxing in assignment and related contexts) to just use the binding.  For example:
> 
>     void m(Integer i) { ... }
>     ...
>     plus some pattern Foo(int x) 
>     ...
> 
>     switch (x) { 
>         case Foo(int x): m(x);
>     }
> 
> We don't care that we got an int out; when we need an Integer, the right thing happens.  In the other direction, we have to worry about NPEs, but we can fix that with pattern tools we have:
> 
>     switch (x) {
>         case Bar(Integer x & true(x != null)): ... safe to unbox x ...

Huh.  Or, you could use an as-if-cast rule, and get the null
check without further ado.

    switch (x) {
        case Bar(int x): … it was safe to cast (int)x, hence to unbox x ...

> So I think our strawman holds up: primitive type patterns are total on their type, with no added boxing/narrowing/widening weirdness.

Yeah, none of that JLS weirdness.  I hear another
siren singing here:  “Ugly mistakes of the past, you
shall be fixed…”

> We can characterize this as a new context in Ch5 ("conditional pattern match context"), that permits only identity and reference widening conversions.

(It’s an odd kind of conversion, one that is never performed,
just one that models an restricted set of type relations.
I would prefer to lean harder on the ones we have already.
Maybe there’s some reason that won’t work.  The reasons
given above are *not* that reason; the complexity is
superficial and dodging it adds other complexity, such
as new niche conversion rules and needless asymmetry.)

> And when we get to Valhalla, the same is true for type patterns on primitive classes.  

When we get there, we will still have to decide whether to
emulate some or all of the primitive numeric conversions.
If we do our homework now, it will be easier then.

> ** BONUS ROUND **
> 
> Now, let's talk about pattern assignment statements, such as:
> 
>     Point(var x, var y) = aPoint
> 
> The working theory is that the pattern on the LHS must be total on the type of the expression on the RHS, with some remainder allowed, and will throw on any remainder (e.g., can throw NPE on null.

Good.

> If we want to align this with the semantics of local variable declaration + initializer, we probably *do* want the full set of assignment-context conversions, which I think is fine in this context (so, a second new context: unconditional pattern assignment, which allows all the same conversions as are allowed in assignment context.)  

That sounds right.

> If the set of conversions is the same, then we are well on our way to being able to interpret 
> 
>     T t = e
> 
> as *either* a local variable declaration, *or* a pattern match, without the user being able to tell the difference:
> 
>  - The scoping is the same (since the pattern either completes normally or throws);
>  - The mutability is the same (we fixed this one just in time);
>  - The set of conversions, applicable types, and potential exceptions are the same (exercise left to the reader.)

I like this a lot.

> Which means (drum roll) local variable assignment is revealed to have been a degenerate case of pattern match all along.  (And the crowd goes wild.)

Yes, plain assignment also, if we wish.  I think the best way
to do this may be to use a full declaration with a pattern at
its head, but to mark some of the binding variables inside
the pattern as “assign, do not bind fresh”.  This would also
be useful in “instanceof” expressions.

After finishing our victory dance, we will realize that assignment
and declaration still have to be distinguished syntactically.
See also my previous message, which started on arrays, and
then worked around to list patterns and other extractors,
and proposed an “assign do not bind” marker to request
assignment, rather than distinct declaration and assignment
syntaxes.

HTH!

— John