<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Our qbicc project handles classfiles extensively, as one might imagine, for both parsing and generation of classes. I've been playing around with using the new API instead. <span class="gmail_default" style="font-family:Arial,Helvetica,sans-serif">S</span><span class="gmail_default">o far, I am liking this API a lot; the design is overall very sensible and usable as far as I can tell; though </span>it is still pretty early and not a lot is working yet, I do have some initial impressions.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Parsing</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">When we parse a method body, our parser is processing it directly into SSA form for later analysis. We're doing this in a depth-first recursive manner, where we start from the top of the method, and at each instruction which corresponds to a basic block terminator (which would be things like GOTO*, IF*, *SWITCH, *RETURN, ATHROW, and also some special cases like method invocation inside of a `try` block), we close up the current block and recursively process each unprocessed successor block (if any). In this way we naturally ignore any unreachable bytecodes - not just from bizarrely-formed class files (though this is possible) but also from parsing conditional constructs where we can establish a constant condition early in processing.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">However this approach depends on being able to randomly access the bytecode body. This seems doable with the new API, but unless I missed some helper method(s), to do so apparently requires iterating the instruction list, collecting all of the labels, and building a label-to-integer-key mapping to locate the list indices where processing should be resumed for a given label.  It would certainly be nice to be able to have a more flexible seeking solution, like a special iterator API which can seek based on labels for example.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Of course another option is to rework the algorithm to process every basic block in the bytecode from top to bottom, and let unreachable basic blocks fall out of the graph. This is not an unreasonable option, though it does generally require at least one more pass of processing in order to do things like number the blocks, identify loops, and establish reachability.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Another issue with the strong encapsulation of BCI as labels is that it does not seem possible to find the BCI of an arbitrary instruction. This can be a problem for example when we need to record the BCI of a method invocation within a try block. A usable solution could be to automatically generate a Label before any invocation within a try/catch region (unless this is already being done). It also makes debugging mode difficult, as presently every node in our program graph records original line number and BCI information to make it easy to correlate subgraphs with their original bytecodes.</div><div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">This is less of a problem on the generation side since it appears that I can generate a label before any instruction, and then collect the corresponding BCIs after the method body is compiled.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Generation</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">We also generate classes for various purposes so I was doing some experiments with this as well. So far I have found this to be fairly straightforward, but I have so far encountered one minor API issue with this API (which to be completely fair ASM also suffers from).</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">With ASM, when you're emitting instructions, you have to know not only the opcode of the instruction you're emitting but also the particular API method which corresponds to the correct instruction shape. This is excusable to an extent within ASM, because the opcode argument is an `int` so if there was only one overloaded method name with every shape, it might be too easy to make a mistake (never mind how relatively poetic it would have been for the main assembly method in ASM to be called `asm` :-) ).</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">However, this API is otherwise very strongly typed, taking full advantage of the new pattern matching and sealing capabilities. So I was a bit surprised when all instruction opcodes were still represented by a single type (in this case an `enum`), even though there are enough different opcode shapes or characteristics to warrant *six* different constructors.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Would it not make sense to make `Opcode` a sealed interface, with an enum for each opcode shape? In this way, instead of having a method for each of *many* (but not all) instructions (many of which are highly similar internally) and several overlapping ASM-like "emit this shape by name" methods for *some* other instructions - which ambiguously accepts a plain, generally-typed Opcode - there could be (many fewer) emit methods which accept a specific opcode type as the first argument and the correct argument values for subsequent arguments? Then as a developer, I need only to know which opcode I want to emit, and in my IDE I can for example type `cb.emit(GOTO, <ctrl-P>` and immediately see that the GOTO instruction requires a `Label` argument, because that overload will be unambiguously selected. This also makes it much harder to make an error involving the wrong instruction shape; invalid-opcode errors that this API would only raise at run time would then be detectable directly within one's IDE without even having to compile, which improves ease of use.</div></div><div><br></div><div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Obviously this is wandering dangerously close to the bikeshed borderline, however one other real-world advantage is that an enum constant in a more specific `*Opcode` subtype type can store more useful information about itself that a consumer could use; for example, the opcode constant for `IFEQ` could have a method `complement` which yields `IFNE`, which can be useful for simplifying some code generators (and I can think of specific cases both within qbicc and within Quarkus where this would have been useful).</div></div><div><span style="font-family:arial,helvetica,sans-serif"></span></div><div><div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">- DML • he/him<br></div><div dir="ltr"><br></div></div></div></div></div>