jextract on large Windows header files

Mon Feb 1 11:27:44 UTC 2021

Hi Duncan,
thanks for experimenting further - some comments inline

On Sun, 2021-01-31 at 15:54 +0000, Duncan Gittins wrote:
> I've downloaded jdk/panama-foreign and experimented with some changes
> to 
> jextract in order to help with the footprint of jars generated from
> the 
> large Windows header files that I mentioned in the thread: Feedback
> / 
> query on jextract for Windows 10.
> 
> 1) Added command line args: --symbol {symbol-to-include} and --
> symbols 
> {file-of-symbols-to-include} which eliminates all top level symbols
> not 
> in the parameter list or symbol file(s). This avoids the need to 
> generate huge amounts of unnecessary methods.
> 
> For example in my application, "jextract shlobj_core.h" gives ~ 11MB
> jar 
> (~ 49k top level symbols), adding -filter for required small group
> of 
> header files cuts this to 880k (5k symbols), and with the --symbols 
> filter that is down to a 10K jar (9 symbols) which is much more 
> managable size to use. Also the lack of lazy load in --source output
> is 
> also no longer a problem because all static field method handles are 
> generally used.

We used to have something similar in the previous version of jextract.
We're a bit on the fence as to whether to re-add it. The rationale
being that any filtering mechanism is not perfect, and might actually
introduce extraction bugs - think of a function A which depends on a
struct S - if you extract A then you have to extract S (and everything
S depends on). Figuring out these dependencies is not trivial - in fact
it's not even possible in the general case; for instance in a library
like OpenGL you might correctly import a set of functions you want to
use - but what about the macro constants? OpenGL is quite unusable w/o
its set of constants - which one should be imported? The header file
gives us no clue as to which function depends on which constant.

In addition, what we have noticed is that, with filtering, is very easy
to start "simple" and to end up with a crawling web of slightly inter-
related options (include lists, exclude lists, transitive dependencies,
etc.) which generate a lot of complexity while not being, per se,
tremendously general.

For these reasons, we have opted to go for an API-oriented approach,
where, if needed, a client could interact with a jextract API, and add
the required filtering. If you look at the `JextractTool` class, there
are a bunch of static methods which can be used for parsing the C code,
and for generating source/class files from the AST that has been
produced by the parser.

It is possible for a client to insert an extra step in the middle
between parsing and generation, so as to allow for custom filtering
strategies.

That said, the problem of filtering is a common one, and I suspect many
will stumble with similar questions - so perhaps is worth shaking the
pandora's box once more to see if we can agree on a solution that is
general enough, w/o being overly complex.

One thing I'm open to is to add a separate configuration (text) file,
with some random syntax, which allowed clients to specify a list of
symbols to import (structs, constants, functions, typedef, ...). There
is no magic added by jextract: jextract would simply consult the lists
during extraction and then decide as to whether include a given symbol
in the generated AST or not. Jextract might issue warnings in cases it
detects that certain dependencies are missing (but as stated above,
this analysis could not be deemed "complete" in any sort of way).

(Another twist on that idea we've considered would be to use C itself
for defining the symbols to import, through a custom header file which
includes all the function symbols to be imported - but while that works
well for functions, I don't think it scales to other things like
structs and constants).

> 
> 2) Changed "jextract --source" so that identical FunctionDescriptor 
> declarations are re-used. This makes the compiled source jar a
> little 
> bit smaller than the generated class jar. For example many of the 
> Windows functions such as SetUserObjectInformationA / 
> SetUserObjectInformationW have the same descriptor, so jextract can
> emit 
> just one FUNC_ declaration and make MH_  / $FUNC() of the second
> refer 
> to the earlier declaration like this:
> 
>      /**  DUPLICATE => SetUserObjectInformationA$FUNC_
>      static final FunctionDescriptor SetUserObjectInformationW$FUNC_
> = 
> FunctionDescriptor.of(C_INT,C_POINTER,C_INT,C_POINTER,C_LONG);
>      */
>      static final jdk.incubator.foreign.FunctionDescriptor 
> SetUserObjectInformationW$FUNC() { return 
> SetUserObjectInformationA$FUNC_; }

There's a lot of duplication in the source-generation - it's not just
function descriptors - but also struct layouts, etc. In principle we
should de-dup all these, but then another subtle problem arises - which
is you can't forward reference a static constant - so, completely de-
duping the graph of constants requires building a tree of dependencies
between the various constants, do a topological sort, and start adding
constants from the leaves. All this magic happens "for free" when using
classfile generation (and would also happen for free if the language
had some concept of lazy statics).

Before tackling this complexity, I'd like to have a better sense of how
far we might be from getting some support for lazy statics - as the
kind of complexity described above to get better source generation
feels like a bit of a dead-end.

CheersMaurizio

> 
> If you think either of these will help, I can share the small set of 
> changes I made.
> 
> Duncan
> 
> 
> 
>