Trying out the API for qbicc

Wed Jul 27 15:56:10 UTC 2022

Thanks for giving it a try!

On 7/27/2022 10:42 AM, David Lloyd wrote:
> Our qbicc project handles classfiles extensively, as one might 
> imagine, for both parsing and generation of classes. I've been playing 
> around with using the new API instead. So far, I am liking this API a 
> lot; the design is overall very sensible and usable as far as I can 
> tell; though it is still pretty early and not a lot is working yet, I 
> do have some initial impressions.
>
> Parsing
>
> When we parse a method body, our parser is processing it directly into 
> SSA form for later analysis. We're doing this in a depth-first 
> recursive manner, where we start from the top of the method, and at 
> each instruction which corresponds to a basic block terminator (which 
> would be things like GOTO*, IF*, *SWITCH, *RETURN, ATHROW, and also 
> some special cases like method invocation inside of a `try` block), we 
> close up the current block and recursively process each unprocessed 
> successor block (if any). In this way we naturally ignore any 
> unreachable bytecodes - not just from bizarrely-formed class files 
> (though this is possible) but also from parsing conditional constructs 
> where we can establish a constant condition early in processing.
>
> However this approach depends on being able to randomly access the 
> bytecode body. This seems doable with the new API, but unless I missed 
> some helper method(s), to do so apparently requires iterating the 
> instruction list, collecting all of the labels, and building a 
> label-to-integer-key mapping to locate the list indices where 
> processing should be resumed for a given label. It would certainly be 
> nice to be able to have a more flexible seeking solution, like a 
> special iterator API which can seek based on labels for example.

In ASM, you would use the "tree" API, to materialize the body into a 
random-access data structure.  This is a bit unfortunate, because (a) 
the tree API is much slower than the streaming API, and (b) it is also 
somewhat different from the streaming API.  (And mutable.)

We intent to improve on that, by having the "materialized" API just be 
"put the elements in a list/tree structure".  For 
ClassModel/MethodModel, you can see the idea in play; you can stream the 
elements of a ClassModel, and you'll get methods, fields, etc, but you 
could also just call ClassModel::fields and it will materialize (and 
cache) a List<FieldModel> and return that. What you want is the 
equivalent for CodeModel, which is conceptually similar but we are 
missing a few things.

You can of course call CodeModel::elementList and get a 
List<CodeElement> out, which includes the label targets inline.  What's 
missing is the ability to map labels to *list indexes*.  We know we want 
this, we made a stab at it in an early prototype, it was a mess (because 
some other things were a mess), but we would like to return to this.

> Another issue with the strong encapsulation of BCI as labels is that 
> it does not seem possible to find the BCI of an arbitrary instruction.

This is related to a comment recently from Rafael, in that this works 
when we are traversing a *bound* CodeModel, but not a buffered code 
model (which might result from an intermediate stage of a 
transformation.)  If we are OK with making operations like bci() 
partial, we can address this by, say, defining a refined 
`Iterator<CodeElement>` that also has a bci() accessor.  This works when 
parsing, but not necessarily when transforming, but that might be OK.

> Generation
>
> We also generate classes for various purposes so I was doing some 
> experiments with this as well. So far I have found this to be fairly 
> straightforward, but I have so far encountered one minor API issue 
> with this API (which to be completely fair ASM also suffers from).
>
> With ASM, when you're emitting instructions, you have to know not only 
> the opcode of the instruction you're emitting but also the particular 
> API method which corresponds to the correct instruction shape. This is 
> excusable to an extent within ASM, because the opcode argument is an 
> `int` so if there was only one overloaded method name with every 
> shape, it might be too easy to make a mistake (never mind how 
> relatively poetic it would have been for the main assembly method in 
> ASM to be called `asm` :-) ).

You should think of the generation methods as layered.  At the most 
abstract, there is `with(CodeElement)`.  Every other generation method 
bottoms out here.  At the next level, there are the ones that correspond 
to the coarse categories in the data model, such as `load(kind, slot)` 
or `operator(opc)`.  At the finest level, there are methods for 
aload_0() and ishl(), which again all bottom out in `with(CodeElement)`.

Our assumption is that most "hand coded" generation code will prefer the 
most fine-grained ones, pattern-driven transformation code will probably 
do things like match on `LoadInstruction` and turn around and call 
load() again, maybe with different arguments, and "purely mechanical" 
transformation code will probably prefer just making elements and 
shoveling them down the pipeline.

> However, this API is otherwise very strongly typed, taking full 
> advantage of the new pattern matching and sealing capabilities. So I 
> was a bit surprised when all instruction opcodes were still 
> represented by a single type (in this case an `enum`), even though 
> there are enough different opcode shapes or characteristics to warrant 
> *six* different constructors.

I think you may be mixing the Opcode and Instruction abstractions? The 
`Opcode` abstraction is explicitly about bytecodes and bytecode-specific 
metadata, whereas an Instruction is an instantiation of an Opcode + 
operands.  (Some instructions, of course, have no operations (e.g., 
`iadd`); in this case, you'll notice the implementation has a singleton 
cache.)

The Opcode type mostly serves the implementation, to facilitate mapping 
to metadata (instruction size, kind, etc), and to manage the weirdness 
of the WIDE opcodes.  (If it were not for WIDE, I'd probably have just 
gone with `byte` and lookup functions.)

I find it a little unfortunate that some methods like `branch` require 
an Opcode argument -- feels like mixing levels, as you suggest -- but 
the alternatives were worse.

> Would it not make sense to make `Opcode` a sealed interface, with an 
> enum for each opcode shape?

We tried something like this early on.  It ran into the problem that 
switching over multiple enums in one switch is not supported.  So having 
multiple enums may be more rich in modeling, but clients pay a penalty 
-- multiple switches.  This didn't feel like a good trade.  (It is 
possible the API and implementation has evolved since then, to make this 
less problematic, but that would have to be established.)

> In this way, instead of having a method for each of *many* (but not 
> all) instructions (many of which are highly similar internally) and 
> several overlapping ASM-like "emit this shape by name" methods for 
> *some* other instructions - which ambiguously accepts a plain, 
> generally-typed Opcode - there could be (many fewer) emit methods 
> which accept a specific opcode type as the first argument and the 
> correct argument values for subsequent arguments?

I don't think there would be "many fewer" methods; it just means that 
some of the type checking can be moved from runtime to compile time 
(e.g., branch(opc, label) wouldn't let you use IADD as the opcode).   
But I would think all the same methods would be there, just with tighter 
types.

> Obviously this is wandering dangerously close to the bikeshed 
> borderline, however one other real-world advantage is that an enum 
> constant in a more specific `*Opcode` subtype type can store more 
> useful information about itself that a consumer could use; for 
> example, the opcode constant for `IFEQ` could have a method 
> `complement` which yields `IFNE`, which can be useful for simplifying 
> some code generators (and I can think of specific cases both within 
> qbicc and within Quarkus where this would have been useful).

This method exists in the library as an Opcode -> Opcode method.

Cheers,
-Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/classfile-api-dev/attachments/20220727/dd3b7dd3/attachment-0001.htm>