a model for jextract runs
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Aug 30 15:35:36 UTC 2018
Hi,
here are some thoughts on how I envision jextract to work - I think it
leads to some very desirable properties, but this is also my part of the
elephant [1] - other might be equally valid and/or applicable. This is
part of an ongoing effort to make jextract more usable, by avoiding
common user mistakes, a property that we have internally dubbed as
'battery-included'. You will soon start see some code changes activity
related to this - so this email sets up the tone for the kind of
jextract we are moving towards: as the set of libraries accepted by
jextract grows, it is time to lay some more solid foundations beneath it.
First, I believe the right angle to look at what jextract does is that
of dynamic libraries. E.g. think of jextract as a tool that takes a
bunch of logically related header files and generate an artifact (*)
which models the dynamic library (e.g. .so/.dynlib file) that would be
obtained by building the native code for real.
(*) In this email I will not dwell on how jextract should represent its
output artifact; should it be a jar file with some special manifest? A
jmod? This is largely an orthogonal question and I'm not going to make
any assumptions here on the shape of the generated artifact. But I'm
going to assume that the generated artifacts will provide some basic
info about how they have been obtained, which library they refer to, etc.
Now, since most (all?) of the libraries out there are going to assume
the availability of some 'standard library', let's also assume that some
extracted artifact for such library is available and that jextract
always knows how to find it - this is the equivalent of java.base for
the module system, or java.lang for the Java import system. This
addresses the bootstrapping issue.
Now, let's assume I want to extract an header for a library A. For
simplicity, let's assume that such a library will come with:
* a set of headers H1, H2 ... Hn
* a set of shared or linkable object files L1, L2 … Lm
* a set of required libraries (dependencies) B1, B2, … Bk
where in general n is positive, m is often one or zero, and k is often zero.
Each object file Li is typically backed by one or more source files Cij,
but such source files, with any internal APIs they may contain, are
usually disregarded by the user of the library A, and we will ignore
them in the remainder of this email.
Note that, in general, library A can depend on symbols defined in other
libraries - A will include all of them via its headers H1 ... Hn. That
said, this doesn't mean that the library artifact for A contains all
such dependent symbols. In fact such symbols won't be defined in any of
L1 ... Lm, and the linker is in fact responsible for finding out these
symbols (in B1 .. Bk), and linking the generated library artifact to
them (or, in case of dynamic libraries, simply verifying that they exist
- as real linking happens at runtime).
This simple example is already enough to illustrate a crucial problem:
how does jextract distinguish between _external_ dependencies and
_internal_ ones?
From a C perspective, the difference between an internal and an
external dependency on some symbol S can be summarized by this question:
where is S defined?
* S is defined by L1 ... Lm -> internal dependency
* S is not defined by L1 ... Lm -> external dependency
This immediately shows why it is so hard for jextract to classify
dependencies: in the status quo, jextract is a tool that mostly operates
on headers (e.g. H1 ... Hn), but not on object files (e.g. L1 ... Lm).
This means that, while we can see all the symbols a library is going to
require (by recursively following headers), we have no idea (at
extraction time) as to whether these symbol should be co-generated in
the same artifact.
To solve this conundrum there are possible ways out:
* use some classification heuristic - (this scheme was first proposed by
Sundar) all headers in the same folder F belong to the same logical unit
(e.g. library A) - so, we can partition the set H1 ... Hn in two sets,
based on whether Hi is in F or not - let's call these sets Hf(H1 ... Hn)
and H!f(H1 ... Hn) = H1 ... Hn - Hf(H1 ... Hn). It follows that, if a
symbol is defined in an header in Hf(H1 ... Hn) then the dependency is
treated as internal, otherwise it is treated as external. Given the
simplicity of the scheme, this might well be the fallback behavior in
case no extra option is supplied to jextract.
* add back some information about C1...Cm - now, adding object file
treatment to jextract would be brutal, but we can do better: after all,
we are trying to generate/extract interfaces for library A - so we could
piggy back on the dynamic library for A itself! This library will
contain all the symbols defined by the source files C1 ... Cm available
when the library was built. Again, this induces a partitioning on our
symbols: if a symbol S occurs in the library for A, then S is an
internal dependency, otherwise it's treated as external. Now, this has
some limitations, as (at least w/o debugging info), a library knows
nothing about structs. So we would have to assume that a struct is
treated as internal/external based on the symbol who's using the struct.
In case two symbols, an internal one and an external one, refer to the
same struct, then the struct is promoted to external (e.g. external
dependencies always trump internal ones).
* in the dual of the scheme suggested above, we would classify a symbol
S as internal if and only if it is not defined in any of B1 ... Bk (or
the jextracted artifacts for B1 ... Bk, assuming we can point jextract
at them!)
With these definitions, we now should have a much clearer idea of which
symbols should end up in the jextract artifact for A.
These definition are chosen so that the behavior of jextract is likely
to mimic what happens when the native library is built using the C/C++
toolchain - e.g. the set of symbols available in the generated artifact
is the same as the set of symbol defined in the native library.
This property is what allows us to combine multiple jextract runs
together. Let's say that I have a library A2 and A2 depends on A. If I
had to build A2 using the native toolchain, I would probably use some -l
linker flag to point the flag at the required A library, and also to
pass the set of included files associated with A.
In the jextract world, we already have an artifact modeling A (see
above). So, if we want to extract A2, what we need is simply a way to
point jextract at such artifact. If the artifact is sufficiently
self-described (see assumption above), it should be possible for
jextract to handle this case just fine. Again, the symbol ending up in
A2's artifact are only the symbol occurring in A2's library. In other
words, the artifacts we get out of jextract can be _composed_ together,
which is a crucial property we're after here.
Another property that we should strive to guarantee is that if an
external dependency is found (according to either scheme above) and
jextract cannot resolve such dependency (e.g. because the user forgot to
pass extra options to specify dependent libraries), _an error should be
generated at extraction time_. This is similar to the error you get in C
when you try to link an executable and a definition is missing. We don't
want to wait at runtime for such errors to pop up.
One final consideration on layout resolution context: if we adopt the
approach outlined in the email, it follows that the 'right' resolution
scope is neither a class, nor a package, nor a module (the current
implementation is a bit ambivalent about this). In fact, the right
resolution scope for library A2 is A2 itself - which means when binding
A2 we should be able to refer to all symbols defined by A2 itself, but
also to all the symbols defined in A (and, also, those defined in the
standard library). This seems to hint at the fact that we need some way
to express/reify library dependencies in NativeHeader interfaces, so
that these dependencies can be inspected by the binder, and so that a
correct resolution scope can be constructed.
Maurizio
[1] - https://en.wikipedia.org/wiki/Blind_men_and_an_elephant
More information about the panama-dev
mailing list