a model for jextract runs

Thu Aug 30 15:35:36 UTC 2018

Hi,
here are some thoughts on how I envision jextract to work - I think it 
leads to some very desirable properties, but this is also my part of the 
elephant [1] - other might be equally valid and/or applicable. This is 
part of an ongoing effort to make jextract more usable, by avoiding 
common user mistakes, a property that we have internally dubbed as 
'battery-included'. You will soon start see some code changes activity 
related to this - so this email sets up the tone for the kind of 
jextract we are moving towards: as the set of libraries accepted by 
jextract grows, it is time to lay some more solid foundations beneath it.

First, I believe the right angle to look at what jextract does is that 
of dynamic libraries. E.g. think of jextract as a tool that takes a 
bunch of logically related header files and generate an artifact (*) 
which models the dynamic library (e.g. .so/.dynlib file)  that would be 
obtained by building the native code for real.

(*) In this email I will not dwell on how jextract should represent its 
output artifact; should it be a jar file with some special manifest? A 
jmod? This is largely an orthogonal question and I'm not going to make 
any assumptions here on the shape of the generated artifact. But I'm 
going to assume that the generated artifacts will provide some basic 
info about how they have been obtained, which library they refer to, etc.

Now, since most (all?) of the libraries out there are going to assume 
the availability of some 'standard library', let's also assume that some 
extracted artifact for such library is available and that jextract 
always knows how to find it - this is the equivalent of java.base for 
the module system, or java.lang for the Java import system. This 
addresses the bootstrapping issue.

Now, let's assume I want to extract an header for a library A. For 
simplicity, let's assume that such a library will come with:

* a set of headers H1, H2 ... Hn
* a set of shared or linkable object files L1, L2 … Lm
* a set of required libraries (dependencies) B1, B2, … Bk

where in general n is positive, m is often one or zero, and k is often zero.

Each object file Li is typically backed by one or more source files Cij, 
but such source files, with any internal APIs they may contain, are 
usually disregarded by the user of the library A, and we will ignore 
them in the remainder of this email.

Note that, in general, library A can depend on symbols defined in other 
libraries - A will include all of them via its headers H1 ... Hn. That 
said, this doesn't mean that the library artifact for A contains all 
such dependent symbols. In fact such symbols won't be defined in any of 
L1 ... Lm, and the linker is in fact responsible for finding out these 
symbols (in B1 .. Bk), and linking the generated library artifact to 
them (or, in case of dynamic libraries, simply verifying that they exist 
- as real linking happens at runtime).

This simple example is already enough to illustrate a crucial problem: 
how does jextract distinguish between _external_ dependencies and 
_internal_ ones?

 From a C perspective, the difference between an internal and an 
external dependency on some symbol S can be summarized by this question: 
where is S defined?

* S is defined by L1 ... Lm -> internal dependency
* S is not defined by L1 ... Lm -> external dependency

This immediately shows why it is so hard for jextract to classify 
dependencies: in the status quo, jextract is a tool that mostly operates 
on headers (e.g. H1 ... Hn), but not on object files (e.g. L1 ... Lm). 
This means that, while we can see all the symbols a library is going to 
require (by recursively following headers), we have no idea (at 
extraction time) as to whether these symbol should be co-generated in 
the same artifact.

To solve this conundrum there are possible ways out:

* use some classification heuristic - (this scheme was first proposed by 
Sundar) all headers in the same folder F belong to the same logical unit 
(e.g. library A) - so, we can partition the set H1 ... Hn in two sets, 
based on whether Hi is in F or not - let's call these sets Hf(H1 ... Hn) 
and H!f(H1 ... Hn) = H1 ... Hn - Hf(H1 ... Hn). It follows that, if a 
symbol is defined in an header in Hf(H1 ... Hn) then the dependency is 
treated as internal, otherwise it is treated as external. Given the 
simplicity of the scheme, this might well be the fallback behavior in 
case no extra option is supplied to jextract.

* add back some information about C1...Cm - now, adding object file 
treatment to jextract would be brutal, but we can do better: after all, 
we are trying to generate/extract interfaces for library A - so we could 
piggy back on the dynamic library for A itself! This library will 
contain all the symbols defined by the source files C1 ... Cm available 
when the library was built. Again, this induces a partitioning on our 
symbols: if a symbol S occurs in the library for A, then S is an 
internal dependency, otherwise it's treated as external. Now, this has 
some limitations, as (at least w/o debugging info), a library knows 
nothing about structs. So we would have to assume that a struct is 
treated as internal/external based on the symbol who's using the struct. 
In case two symbols, an internal one and an external one, refer to the 
same struct, then the struct is promoted to external (e.g. external 
dependencies always trump internal ones).

* in the dual of the scheme suggested above, we would classify a symbol 
S as internal if and only if it is not defined in any of B1 ... Bk (or 
the jextracted artifacts for B1 ... Bk, assuming we can point jextract 
at them!)

With these definitions, we now should have a much clearer idea of which 
symbols should end up in the jextract artifact for A.

These definition are chosen so that the behavior of jextract is likely 
to mimic what happens when the native library is built using the C/C++ 
toolchain - e.g. the set of symbols available in the generated artifact 
is the same as the set of symbol defined in the native library.

This property is what allows us to combine multiple jextract runs 
together. Let's say that I have a library A2 and A2 depends on A. If I 
had to build A2 using the native toolchain, I would probably use some -l 
linker flag to point the flag at the required A library, and also to 
pass the set of included files associated with A.

In the jextract world, we already have an artifact modeling A (see 
above). So, if we want to extract A2, what we need is simply a way to 
point jextract at such artifact. If the artifact is sufficiently 
self-described (see assumption above), it should be possible for 
jextract to handle this case just fine. Again, the symbol ending up in 
A2's artifact are only the symbol occurring in A2's library. In other 
words, the artifacts we get out of jextract can be _composed_ together, 
which is a crucial property we're after here.

Another property that we should strive to guarantee is that if an 
external dependency is found (according to either scheme above) and 
jextract cannot resolve such dependency (e.g. because the user forgot to 
pass extra options to specify dependent libraries), _an error should be 
generated at extraction time_. This is similar to the error you get in C 
when you try to link an executable and a definition is missing. We don't 
want to wait at runtime for such errors to pop up.

One final consideration on layout resolution context: if we adopt the 
approach outlined in the email, it follows that the 'right' resolution 
scope is neither a class, nor a package, nor a module (the current 
implementation is a bit ambivalent about this). In fact, the right 
resolution scope for library A2 is A2 itself - which means when binding 
A2 we should be able to refer to all symbols defined by A2 itself, but 
also to all the symbols defined in A (and, also, those defined in the 
standard library). This seems to hint at the fact that we need some way 
to express/reify library dependencies in NativeHeader interfaces, so 
that these dependencies can be inspected by the binder, and so that a 
correct resolution scope can be constructed.

Maurizio

[1] - https://en.wikipedia.org/wiki/Blind_men_and_an_elephant