Evolving CONSTANT_Class

Mon Jun 15 20:54:14 UTC 2020

> On Jun 15, 2020, at 1:28 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
>> Another way to handle it is to distinguish between a *species*, which is a class-like entity, and a *species type*. It's helpful to remember that there may be inline types of species (that is, a "Q envelope" of a species).
> 
> I think this is a fruitful direction; I can have `ArrayList[T] extends List[T]` where it is a class-like use, and I can have `Foo[T].x` where it is a type-like use.  

Concretely, what does this mean for the class file? Are you suggesting that 'List[T]' and 'Foo[T]', above, should have different encodings? Or at least represent different entities?

What seems attractive to me for now is that we have CONSTANT_Species for the first one,  and some sort of type encoding (probably referencing CONSTANT_Species) for the second one.

>> 1) Treat everything in the class/interface table as a degenerate use of a type. A class name is always interpreted as an L type.
> 
> Given that a specializable class Foo<T> gives rise to species Foo[x] and Foo[y], _and_ a class type Foo such that Foo[t] <: Foo for all t, the duality between class and type here seems inevitable.  

There *are* two concepts here. That seems inevitable. But it's possible that, as a lumping move, we'll *encode* all class-flavored uses as types, and then infer the intended class from whatever type gets used.

So, e.g.: a CONSTANT_Class encodes a type, full stop. 'this_class' refers to a type that is the type of 'this' in the current class. 'new' refers to a type that is the class type of a new class instance. NestHost refers to the type of 'this' for the class that acts as the nest host. Etc.

>> - How do we handle type variables, both top-level and nested? Either we embed constant pool pointers in Utf8 entries (yuck!), or we need to extend Class constants to support references both to Utf8 entries and to [some new thing].
> 
> This is the stringy-vs-tree problem we've been wrestling with for a long time.  The solution to this problem seems to hinge on the solution to that one.  

>> - Should we revisit "naked" descriptor references, allowing them to point to either bare Utf8 entries or Class constants and MethodType/[something else] constants? Do we try           to migrate the world away from naked descriptor references?
> 
> I think this may well fall out of the "trees vs strings" discussion.

Without getting in the weeds on "trees vs. strings", let's just assume we come up with a solution. That solution is very likely not going to embed constant pool pointers in a Utf8 (because tools that manipulate constant pool pointers would be sad to be in the business of parsing/rewriting Utf8 strings). The solution is thus going to need at least 4 bytes (two pointers), so it can express "List[T]" with some encoding of "List" and a pointer to T. The implication is that it's a new flavor of constant. Call that CONSTANT_SpecializedDescriptor.

So, to rephrase my questions in terms of the class file format:

- What does checkcast point to? A CONSTANT_Class is already allowed. We need to add either CONSTANT_SpecailizedDescriptor, or <something new>, or CONSTANT_Class, where CONSTANT_Class can then point to a CONSTANT_SpecializedDescriptor.

- What does the descriptor_index of a field_info point to? A Utf8 is already allowed. CONSTANT_SpecializedDescriptor seems like a natural fit, too. What about CONSTANT_Class instead, or in addition? Is it a "bug" that descriptor_index can't be a CONSTANT_Class already, or is that an intentional design choice?

- What does the descriptor_index of a method_info point to? Same questions, except CONSTANT_MethodType seems to be the analog to CONSTANT_Class here. Or maybe we want to invent a new analog.

>> I'm appealing here to a design principle that seems to have driven the original constant pool design: Class constants are for things that get resolved (and can be cached); descriptor strings are little more than fancy names. This principle doesn't always get followed: the verifier sometimes loads classes named by descriptors; array type class constants resolve their element types without a separate entry; more recently, StackMapTables use Class constants to represent types, and MethodTypes resolve method descriptors "as if" there were class constants for all of the parameter types. But I think these, especially the recent ones, are mistakes, and I still think the original notion is a useful separation of concerns that we should try to follow in our design.
> 
> The tension that comes up here is that we want to be able to match descriptors between clients and declarations.  I don't want to invent one way to describe class constants for species, and another way to embed species in descriptors.  

But this is what the class file has already done! There's the descriptor 'Ljava/lang/Object;', and the constant CONSTANT_Class('java/lang/Object'). An over-arching thing here is whether we think that dual encoding is a mistake, or whether it's a feature of the design.

My take is that CONSTANT_Classes (along with, say, CONSTANT_Methodrefs) are designed for resolution, while Utf8 descriptors are designed for matching. Whatever we want to do about descriptors, I think we should at least have a species encoding that is designed for resolution. (Of course, we can define a resolution algorithm that can handle any encoding. But the idea of breaking up steps of resolution into separate constant pool pointers seems quite useful, directly encoding the "resolution tree" that gets activated when you ask to resolve the species.)

> Now, it may be possible (depending on our translation strategy) that we don't need to embed species in descriptors, because we're just going to erase descriptors, and put the specialization information somewhere else, for the VM to use opportunistically.  That would make the splitting strategy more appealing.  

Back to my taxonomy in the first mail, we really need up to three things:
- A resolvable encoding of the species itself (e.g., for 'new')
- A resolvable encoding of the species type (e.g., for 'checkcast' or as a type argument)
- A descriptor-like encoding of the species type (e.g., for 'field_info' and CONSTANT_NameAndType)

Some of these you may be able to remove from the requirements list, but I don't think that gets you very far.

> Don't forget that when you have a local generic class nested in a generic method, the method args implicitly parameterize the nested class.  Which means that when we refer to a species of the local class, we have to supply the type arguments for both the method and for the local class (and any other enclosing classes.)  Again, there is a lump/split choice here; we can smoosh together the arguments, or provide a trail of witnesses to the enclosing arguments.  If we choose the latter, then it might be mix of C_SMRef and C_Species.

Yeah, if we don't flatten these nests into a top-level class with a long list of type arguments, the outer class/method is one more step in the resolution algorithm that would map nicely to one more pointer in the constant pool encoding.