<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
Thanks for giving it a try! <br>
<br>
<div class="moz-cite-prefix">On 7/27/2022 10:42 AM, David Lloyd
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CANghgrT=r6qSJTck80cVNMu8nZepzhHOu7CDaaJzmcay3m-PCQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Our qbicc
project handles classfiles extensively, as one might imagine,
for both parsing and generation of classes. I've been playing
around with using the new API instead. <span class="gmail_default" style="font-family:Arial,Helvetica,sans-serif">S</span><span class="gmail_default">o far, I am liking this API a lot; the
design is overall very sensible and usable as far as I can
tell; though </span>it is still pretty early and not a lot
is working yet, I do have some initial impressions.</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br>
</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Parsing</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br>
</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">When we parse a
method body, our parser is processing it directly into SSA
form for later analysis. We're doing this in a depth-first
recursive manner, where we start from the top of the method,
and at each instruction which corresponds to a basic block
terminator (which would be things like GOTO*, IF*, *SWITCH,
*RETURN, ATHROW, and also some special cases like method
invocation inside of a `try` block), we close up the current
block and recursively process each unprocessed successor block
(if any). In this way we naturally ignore any unreachable
bytecodes - not just from bizarrely-formed class files (though
this is possible) but also from parsing conditional constructs
where we can establish a constant condition early in
processing.</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br>
</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">However this
approach depends on being able to randomly access the bytecode
body. This seems doable with the new API, but unless I missed
some helper method(s), to do so apparently requires iterating
the instruction list, collecting all of the labels, and
building a label-to-integer-key mapping to locate the list
indices where processing should be resumed for a given label.
It would certainly be nice to be able to have a more flexible
seeking solution, like a special iterator API which can seek
based on labels for example.</div>
</div>
</blockquote>
<br>
In ASM, you would use the "tree" API, to materialize the body into a
random-access data structure. This is a bit unfortunate, because
(a) the tree API is much slower than the streaming API, and (b) it
is also somewhat different from the streaming API. (And mutable.)
<br>
<br>
We intent to improve on that, by having the "materialized" API just
be "put the elements in a list/tree structure". For
ClassModel/MethodModel, you can see the idea in play; you can stream
the elements of a ClassModel, and you'll get methods, fields, etc,
but you could also just call ClassModel::fields and it will
materialize (and cache) a List<FieldModel> and return that.
What you want is the equivalent for CodeModel, which is conceptually
similar but we are missing a few things. <br>
<br>
You can of course call CodeModel::elementList and get a
List<CodeElement> out, which includes the label targets
inline. What's missing is the ability to map labels to *list
indexes*. We know we want this, we made a stab at it in an early
prototype, it was a mess (because some other things were a mess),
but we would like to return to this. <br>
<br>
<blockquote type="cite" cite="mid:CANghgrT=r6qSJTck80cVNMu8nZepzhHOu7CDaaJzmcay3m-PCQ@mail.gmail.com">
<div dir="ltr">Another issue with the strong encapsulation of BCI
as labels is that it does not seem possible to find the BCI of
an arbitrary instruction. </div>
</blockquote>
<br>
This is related to a comment recently from Rafael, in that this
works when we are traversing a *bound* CodeModel, but not a buffered
code model (which might result from an intermediate stage of a
transformation.) If we are OK with making operations like bci()
partial, we can address this by, say, defining a refined
`Iterator<CodeElement>` that also has a bci() accessor. This
works when parsing, but not necessarily when transforming, but that
might be OK. <br>
<br>
<blockquote type="cite" cite="mid:CANghgrT=r6qSJTck80cVNMu8nZepzhHOu7CDaaJzmcay3m-PCQ@mail.gmail.com">
<div dir="ltr">
<div>Generation
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br>
</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">We also
generate classes for various purposes so I was doing some
experiments with this as well. So far I have found this to
be fairly straightforward, but I have so far encountered one
minor API issue with this API (which to be completely fair
ASM also suffers from).</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br>
</div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">With ASM,
when you're emitting instructions, you have to know not only
the opcode of the instruction you're emitting but also the
particular API method which corresponds to the correct
instruction shape. This is excusable to an extent within
ASM, because the opcode argument is an `int` so if there was
only one overloaded method name with every shape, it might
be too easy to make a mistake (never mind how relatively
poetic it would have been for the main assembly method in
ASM to be called `asm` :-) ).</div>
</div>
</div>
</blockquote>
<br>
You should think of the generation methods as layered. At the most
abstract, there is `with(CodeElement)`. Every other generation
method bottoms out here. At the next level, there are the ones that
correspond to the coarse categories in the data model, such as
`load(kind, slot)` or `operator(opc)`. At the finest level, there
are methods for aload_0() and ishl(), which again all bottom out in
`with(CodeElement)`. <br>
<br>
Our assumption is that most "hand coded" generation code will prefer
the most fine-grained ones, pattern-driven transformation code will
probably do things like match on `LoadInstruction` and turn around
and call load() again, maybe with different arguments, and "purely
mechanical" transformation code will probably prefer just making
elements and shoveling them down the pipeline. <br>
<br>
<blockquote type="cite" cite="mid:CANghgrT=r6qSJTck80cVNMu8nZepzhHOu7CDaaJzmcay3m-PCQ@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">However, this
API is otherwise very strongly typed, taking full advantage
of the new pattern matching and sealing capabilities. So I
was a bit surprised when all instruction opcodes were still
represented by a single type (in this case an `enum`), even
though there are enough different opcode shapes or
characteristics to warrant *six* different constructors.</div>
</div>
</div>
</blockquote>
<br>
I think you may be mixing the Opcode and Instruction abstractions?
The `Opcode` abstraction is explicitly about bytecodes and
bytecode-specific metadata, whereas an Instruction is an
instantiation of an Opcode + operands. (Some instructions, of
course, have no operations (e.g., `iadd`); in this case, you'll
notice the implementation has a singleton cache.) <br>
<br>
The Opcode type mostly serves the implementation, to facilitate
mapping to metadata (instruction size, kind, etc), and to manage the
weirdness of the WIDE opcodes. (If it were not for WIDE, I'd
probably have just gone with `byte` and lookup functions.) <br>
<br>
I find it a little unfortunate that some methods like `branch`
require an Opcode argument -- feels like mixing levels, as you
suggest -- but the alternatives were worse. <br>
<br>
<blockquote type="cite" cite="mid:CANghgrT=r6qSJTck80cVNMu8nZepzhHOu7CDaaJzmcay3m-PCQ@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Would it not
make sense to make `Opcode` a sealed interface, with an enum
for each opcode shape? </div>
</div>
</div>
</blockquote>
<br>
We tried something like this early on. It ran into the problem that
switching over multiple enums in one switch is not supported. So
having multiple enums may be more rich in modeling, but clients pay
a penalty -- multiple switches. This didn't feel like a good
trade. (It is possible the API and implementation has evolved since
then, to make this less problematic, but that would have to be
established.)<br>
<br>
<blockquote type="cite" cite="mid:CANghgrT=r6qSJTck80cVNMu8nZepzhHOu7CDaaJzmcay3m-PCQ@mail.gmail.com">
<div dir="ltr">
<div>
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">In this way,
instead of having a method for each of *many* (but not all)
instructions (many of which are highly similar internally)
and several overlapping ASM-like "emit this shape by name"
methods for *some* other instructions - which ambiguously
accepts a plain, generally-typed Opcode - there could be
(many fewer) emit methods which accept a specific opcode
type as the first argument and the correct argument values
for subsequent arguments? </div>
</div>
</div>
</blockquote>
<br>
I don't think there would be "many fewer" methods; it just means
that some of the type checking can be moved from runtime to compile
time (e.g., branch(opc, label) wouldn't let you use IADD as the
opcode). But I would think all the same methods would be there,
just with tighter types. <br>
<br>
<blockquote type="cite" cite="mid:CANghgrT=r6qSJTck80cVNMu8nZepzhHOu7CDaaJzmcay3m-PCQ@mail.gmail.com">
<div dir="ltr">
<div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Obviously this
is wandering dangerously close to the bikeshed borderline,
however one other real-world advantage is that an enum
constant in a more specific `*Opcode` subtype type can store
more useful information about itself that a consumer could
use; for example, the opcode constant for `IFEQ` could have a
method `complement` which yields `IFNE`, which can be useful
for simplifying some code generators (and I can think of
specific cases both within qbicc and within Quarkus where this
would have been useful).</div>
</div>
</blockquote>
<br>
This method exists in the library as an Opcode -> Opcode method.<br>
<br>
Cheers,<br>
-Brian<br>
</body>
</html>