Trying out the API for qbicc

Wed Jul 27 14:42:44 UTC 2022

Our qbicc project handles classfiles extensively, as one might imagine, for
both parsing and generation of classes. I've been playing around with using
the new API instead. So far, I am liking this API a lot; the design is
overall very sensible and usable as far as I can tell; though it is still
pretty early and not a lot is working yet, I do have some initial
impressions.

Parsing

When we parse a method body, our parser is processing it directly into SSA
form for later analysis. We're doing this in a depth-first recursive
manner, where we start from the top of the method, and at each instruction
which corresponds to a basic block terminator (which would be things like
GOTO*, IF*, *SWITCH, *RETURN, ATHROW, and also some special cases like
method invocation inside of a `try` block), we close up the current block
and recursively process each unprocessed successor block (if any). In this
way we naturally ignore any unreachable bytecodes - not just from
bizarrely-formed class files (though this is possible) but also from
parsing conditional constructs where we can establish a constant condition
early in processing.

However this approach depends on being able to randomly access the bytecode
body. This seems doable with the new API, but unless I missed some helper
method(s), to do so apparently requires iterating the instruction list,
collecting all of the labels, and building a label-to-integer-key mapping
to locate the list indices where processing should be resumed for a given
label.  It would certainly be nice to be able to have a more flexible
seeking solution, like a special iterator API which can seek based on
labels for example.

Of course another option is to rework the algorithm to process every basic
block in the bytecode from top to bottom, and let unreachable basic blocks
fall out of the graph. This is not an unreasonable option, though it does
generally require at least one more pass of processing in order to do
things like number the blocks, identify loops, and establish reachability.

Another issue with the strong encapsulation of BCI as labels is that it
does not seem possible to find the BCI of an arbitrary instruction. This
can be a problem for example when we need to record the BCI of a method
invocation within a try block. A usable solution could be to automatically
generate a Label before any invocation within a try/catch region (unless
this is already being done). It also makes debugging mode difficult, as
presently every node in our program graph records original line number and
BCI information to make it easy to correlate subgraphs with their original
bytecodes.

This is less of a problem on the generation side since it appears that I
can generate a label before any instruction, and then collect the
corresponding BCIs after the method body is compiled.

Generation

We also generate classes for various purposes so I was doing some
experiments with this as well. So far I have found this to be fairly
straightforward, but I have so far encountered one minor API issue with
this API (which to be completely fair ASM also suffers from).

With ASM, when you're emitting instructions, you have to know not only the
opcode of the instruction you're emitting but also the particular API
method which corresponds to the correct instruction shape. This is
excusable to an extent within ASM, because the opcode argument is an `int`
so if there was only one overloaded method name with every shape, it might
be too easy to make a mistake (never mind how relatively poetic it would
have been for the main assembly method in ASM to be called `asm` :-) ).

However, this API is otherwise very strongly typed, taking full advantage
of the new pattern matching and sealing capabilities. So I was a bit
surprised when all instruction opcodes were still represented by a single
type (in this case an `enum`), even though there are enough different
opcode shapes or characteristics to warrant *six* different constructors.

Would it not make sense to make `Opcode` a sealed interface, with an enum
for each opcode shape? In this way, instead of having a method for each of
*many* (but not all) instructions (many of which are highly similar
internally) and several overlapping ASM-like "emit this shape by name"
methods for *some* other instructions - which ambiguously accepts a plain,
generally-typed Opcode - there could be (many fewer) emit methods which
accept a specific opcode type as the first argument and the correct
argument values for subsequent arguments? Then as a developer, I need only
to know which opcode I want to emit, and in my IDE I can for example type
`cb.emit(GOTO, <ctrl-P>` and immediately see that the GOTO instruction
requires a `Label` argument, because that overload will be unambiguously
selected. This also makes it much harder to make an error involving the
wrong instruction shape; invalid-opcode errors that this API would only
raise at run time would then be detectable directly within one's IDE
without even having to compile, which improves ease of use.

Obviously this is wandering dangerously close to the bikeshed borderline,
however one other real-world advantage is that an enum constant in a more
specific `*Opcode` subtype type can store more useful information about
itself that a consumer could use; for example, the opcode constant for
`IFEQ` could have a method `complement` which yields `IFNE`, which can be
useful for simplifying some code generators (and I can think of specific
cases both within qbicc and within Quarkus where this would have been
useful).
-- 
- DML • he/him
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/classfile-api-dev/attachments/20220727/3e7cf03f/attachment-0001.htm>