null checks vs class resolution: taking a few steps back

Brian Goetz brian.goetz at oracle.com
Mon Apr 20 18:23:57 UTC 2020


As Fred mentions, he shared this with me a few days ago, and I found 
these arguments very persuasive.  We have already, several times, 
fixated on nullity when the problem was elsewhere, and this feels like 
yet another case of that.  Fixing the underlying technical debt -- that 
the language that `checkcast` and friends have for referring to classes 
is inadequate -- feels like addressing the real problem.

On 4/20/2020 1:20 PM, Frederic Parain wrote:
>
>
>
> Here’s a few thoughts about the null checks vs class resolution issue
> (many thanks to Brian for his review and his improvements to this 
> document).
>
>
>     Checkcast: is it a null issue or a type issue?
>
> There has been some discussion recently on how casts should be translated.
> While the static compiler has considerable latitude on how to 
> translate language
> constructs to bytecode, I’d like to make sure that we first have a 
> clean story
> at the bytecode level, and then take up the translation story (if we 
> still need
> to.)
>
>
>         History, and historical inconveniences
>
> Before Valhalla, classfiles had two ways to denote a reference type: 
> the plain
> name used in |CONSTANT_Class_info| entries, and the name within an 
> envelope in
> the field and method descriptors used in |CONSTANT_Fieldref_info|,
> |CONSTANT_Methodref_info| and |CONSTANT_InterfaceMethodref_info| entries.
>
> Having two syntaxes was already a sign that something was weird, but 
> we mostly
> wrote that off as a historical accident. (Worse, it is not even applied
> uniformly: arrays are always denoted with their envelope, even in
> |CONSTANT_Class_info| entries.) Aesthetics aside, it worked because 
> there was a
> single unambiguous translation from a class name to a class name with 
> envelope.
>
> In the bytecode sequence:
>
> |aload_1 checkcast #10 // class Foo invokestatic #19 // Method 
> Bar:(LFoo;)V |
>
> the real meaning of the |checkcast| was: “I guarantee that the top of 
> stack is a
> reference to an instance of class |Foo| (a.k.a. |LFoo;|), otherwise 
> I’ll throw
> an exception”. Because |null| is valid value of all reference types, 
> the JVM
> does not load the class |Foo| if the value on the top of the stack is 
> a |null|,
> and the verifier is still satisfied that the arguments on the stack 
> match the
> signature of the method begin invoked.
>
>
>         Valhalla turns up the pressure
>
> The Valhalla project introduces a new kind of envelope: |Q*;|. The 
> spelling has
> remained the same, but it’s meaning has evolved with each prototype:
>
>   * With the |v*| bytecodes, it was a marker of a /new kind of type/;
>   * In L-world, it became a marker of /null-hostility/;
>   * In the current user model, it has become /part of the type/.
>
> The last two points require some explanation. In L-world, the L and Q 
> flavors
> of an inline class were projected from a single set of class metadata. 
> In this
> world, there were really three names — the L projection of C, the Q 
> projection
> of C, and the class C itself — all of which could be given meaning. So it
> still could make sense to denote a class just by name — but it’s not 
> clear this
> was a very good idea.
>
> For instance, the |devaultvalue| bytecode used a 
> |CONSTANT_Class_info| entry
> referring to the value class by its plain name. This was unambiguous, 
> because
> /of course/ the |defaultvalue| bytecode was referring to the Q-version 
> of the
> type. (Until some future when we want to apply |defaultvalue| to reference
> types, and get |null| out.) The information was missing from the 
> constant pool
> entry but deduced from the context because of the implicit assumption that
> |defaultvalue| only applies to Q-types. But there were other cases 
> where even
> such implicit assumptions was not sufficient to deduce which variant 
> of a value
> type should be used. The |checkcast| bytecode was one of this cases; 
> it then
> becoame necessary to denote the class argument with the full envelope 
> in order
> to express the expected behavior.
>
> With the new model of inline types, a class can only have one 
> envelope: either
> |Q| if it is an inline type, or |L| otherwise. Which means that 
> |LFoo;| and
> |QFoo;| are not two variants of a same type, but are in fact /two 
> different
> types/.
>
> As much as we’d like to ignore it, if |Foo| is an inline type, it is still
> possible to forge a reference with type |LFoo;| — we can create a 
> class that
> declares a field of type |LFoo;|, instantiate an instance, and read 
> the field.
> This |LFoo;| is a pretty silly type; it cannot interact with any other 
> type, and
> it can only hold |null|. But the JVM has to deal with such silly types 
> all the
> time, such as |LBar;| when |Bar| is a nonexistent class. But the 
> reality is
> that |LFoo;| and |QFoo;| are two different types (with completely 
> disjoint value
> sets!), and we should be honest about it.
>
>     In the current inline type model, the envelope is an essential
>     part of the
>     identification of a type.
>
>
>     Checkcast
>
> The legacy behavior of |checkcast| is on a collision course with the 
> new type
> system. If the following bytecode sequence:
>
> |aload_1 checkcast #10 // class Foo |
>
> still means the same as before — checking that the reference on the 
> top of the
> stack is of type |LFoo;| — we have a problem if |Foo| is an inline class,
> because if the top of stack holds the |null|, the |checkcast| will succeed
> (because null is indeed a valid value of the otherwise-useless type 
> |LFoo;|),
> but this is not really what we had in mind when we asked whether the 
> top of the
> stack held a |Foo|.
>
> It is easy to assume that this is just yet another bad nullity 
> behavior, and
> forgivable to make this assumption because |null| has been the source 
> of so much
> bad behavior in the past. But this would be putting the blame in the wrong
> place.
>
>     In this example, the |checkcast| operation is simply operating on
>     the wrong
>     type, assuming |LFoo;| where it has no right to do so —
>     |LFoo;| and |QFoo;| are
>     completely distinct types.
>
>
>         Quick, plug the hole!
>
> There was a lot of discussions on the EG mailing list, and many 
> proposals for
> ways to restore peace and tranquility. Unfortunately, they all seem to be
> “quick fixes”, are each likely to generate new problems of their own. 
> Without
> recapitulating the details of each of them, here’s a summary of their
> shortcomings:
>
>  *
>     *Generate a different sequence of bytecodes when casting to an inline
>     type.* This is a workaround for the current |checkcast| behavior,
>     but is
>     likely to cause trouble for generic code in the future that is
>     specializable
>     over both identity and inline types, because the goal is to share the
>     bytecode across instantiations, and only patch the constant pool
>     or type
>     descriptors.
>  *
>     *Use |Class::cast|.* |Class::cast| is a generic method returning
>     T, which
>     is erased to |Object|, which will hide the type information the
>     verifier
>     needs to guarantee correctness of method arguments types.
>  *
>     *Use |invokedynamic| to call custom behavior.* This has serious
>     risk of
>     bootstrapping issues.
>  *
>     *Invent a |checknull| bytecode.* This, and nother solutions
>     focusing of
>     the handling of |null|, address the symptom, not the problem. The
>     problem
>     is not the handling of |null|, it is /checking that a particular
>     value is
>     within the value set of this particular type/. The handling of the
>     |null|
>     reference should not be handled separately, and should just fall
>     out of
>     addressing the general question of whether a given value is in the
>     value set
>     of a given type.
>
> All of these solutions feel like quick fixes that are likely to bite 
> us back
> in the fiture. Let’s solve the real problem instead.
>
>
>     Concrete proposal
>
> Let’s fix this by fixing the underlying problem — being explicit about 
> what
> type we are dealing with. Specifically, from Valhalla and beyond, the 
> way to
> denote a class type in a classfile is always a class name with an 
> envelope.
>
> The two possible envelopes (currently) are the L-envelope for types 
> with a value
> set containing |null|, and the Q-envelope for types with a value set not
> containing |null|.
>
> This has several pleasant consequences:
>
>  *
>     All representations within the class file itself are unified:
>     |CONSTANT_Class_info|, |CONSTANT_Fieldref_info|,
>     |CONSTANT_Methodref_info|
>     and |CONSTANT_InterfaceMethodref_info| will all use the same
>     syntax, with no
>     more translation required between names and type descriptors.
>  *
>     Class denotation will be aligned with array denotation, which
>     already uses
>     type descriptors in |CONSTANT_Class_info| entries.
>  *
>     All bytecodes referencing a |CONSTANT_Class_info| entry will have
>     access to
>     the full denotation, envelope + name, even when the class has not been
>     loaded yet.
>  *
>     The verifier will no longer have to translate between names and type
>     descriptors.
>
> For the |checkcast| bytecode, the semantics has to be rephrased: 
> |checkcast|
> must ensure that the reference on the top of the stack is within the 
> value set
> of the type specified in argument, or throw an exception. For 
> |L| types, this
> is the same behavior as before, but for |Q| types, the behavior 
> reflects the
> value set of the type specified in the classfile. If we have:
>
> |aload_1 checkcast #10 // class LFoo; |
>
> then |checkcast| is being used with a type using a L-envelope, so we 
> still know
> |null| is within the value set of |Foo| without having to load |Foo|. 
> If the
> top of stack is not the |null| reference, then |Foo| must be loaded to 
> check if
> this value is part of the remaining of |Foo|‘s value set, as before.
>
> On the other hand, if we have:
>
> |aload_1 checkcast #11 // class QBar; |
>
> then |checkcast| is used with a type using a Q-envelope, which means 
> |null|
> cannot be part of the value set of |Bar|. So if the top of stack 
> contains the
> |null| reference, an exception can be thrown (again, without loading 
> |Bar| if we
> so desire). If the top of stack is not the |null| reference, then 
> |Bar| must be
> loaded to check if this value is part of |Bar|‘s value set, as before.
>
> The bytecode sequence is the same for both inline types and 
> not-inline-types,
> with the behavior being controlled by a constant pool entry, making it 
> suitable
> for our specialization model, and the semantics being derived from the 
> type on
> which |checkcast| operates.
>
> The benefits of always using a name+envelope will be less significant 
> for other
> bytecodes, but they still do exist. (For example, using |new| on an inline
> type, could be caught at verification time instead of runtime.)
>
>     Let’s take this
>     opportunity to address the real problem — correct denotation of
>     types — rather
>     than pinning the blame on |null| (however many sins it committed
>     in the past.)
>     The current loose treatment of non-enveloped names has already
>     caused trouble,
>     and will be a huge source of technical debt going forward. Let’s
>     just pay it
>     off.
>
>
>         Backward compatibility
>
> Pre-Valhalla class files only know about the L-envelope, so the JVM 
> can continue
> to deal with them applying the old default translation from names to |L*;|
> descriptors. The implementation of |checkcast| won’t have to check the 
> class
> file version, as the behavior can be deduced directly from the content 
> of the
> |CONSTANT_Class_info| (plain name -> old syntax, name with envelope -> new
> syntax). New classfiles will reject the old syntax.
>



More information about the valhalla-spec-observers mailing list