[foreign] RFC: Additional jextract filtering

Tue Mar 12 14:48:39 UTC 2019

Maurizio Cimadamore schreef op 2019-03-12 00:52:
> On 11/03/2019 22:30, Jorn Vernee wrote:
>> Maurizio Cimadamore schreef op 2019-03-11 22:33:
>>> On 11/03/2019 15:44, Jorn Vernee wrote:
>>>> Well, I tried changing the defaults to using the REQUIRED preset for 
>>>> structs, enums and typedefs, and this is making a bunch (11) of the 
>>>> tests fail, including TestJextractFFI, which is not a good sign 
>>>> imho. This seems to be happening because the root set computation 
>>>> can not deal with the pimpl/opaque pointer idiom.
>>> I'd like to understand more about this failure mode - example please?
>> 
>> I have to look into this more thoroughly...
>> 
>> Basically, Index.h defines a type like this: `typedef struct 
>> CXTranslationUnitImpl *CXTranslationUnit;` but there is no definition 
>> of CXTranslationUnitImpl anywhere. Something like that can be used to 
>> encapsulate an implementation. Currently this type is not being 
>> included in the dependency set, but to be fair that is probably a 
>> fault of the implementation. There are other failing tests because the 
>> test headers in a lot of cases declare only a struct or union, but no 
>> function that uses them, so they get filtered out.
>> 
>> Any ways, the point I failed to make, is that the root set is still 
>> just a 'guess' at best, so I don't believe it should be the default.
> 
> Well, that might well be the case; my opinion is that the _whole
> business of inferring an API from a set of header_ is a guess at best,
> with (suboptima) choices at every step of the way - how to filter, how
> to name, how to map to carriers, etc. Generally speaking, jextract 
> picks a point in this complex space and hopes that 'it sticks'; but I
> wouldn't read the standard behavior of jextract we happen to have
> right now as fundamentally 'more correct' than the one you were
> looking to replace it with. Both are, in some sense or another, based
> on heuristics.

That's fair.

> So, if we happen to stumble on an heuristics that gives us a more
> compact Java API in most of the non-complicated cases (where by this I
> mostly mean: something other than a system library), I think I'd
> prefer that to what we do now any day of the week.

I think it's also important to keep the heuristic simple. Which 
hopefully also makes it easy and straightforward to manipulate. What I 
like about the current approach is that it's so straight forward. You 
give jextract a header file, and get back a complete binding. If you 
want to make it smaller you could add filters.

I think some sort of automatic filtering (guess) would also need the 
ability to be turned off.

>> 
>>>> But, let's go back to the underlying goal; we want to create 
>>>> jextract output with the least 'junk' possible. I'd say this is not 
>>>> the job of jextract, but the job of the library maintainer. The 
>>>> header file is the interface for using the library, so it should 
>>>> contain things that are all more or less required to use the 
>>>> library. I don't think we will have much success trying to 
>>>> 'outsmart' the writer of the header file. After all, jextract does 
>>>> it's filtering automatically, and the header file is carefully 
>>>> hand-crafted.
>>> 
>>> In principle, I agree - and that's why, longer term, we'll have APIs
>>> that will let you plugin custom filters too. At the same time I think
>>> that the out-of-the-box jextract experience should be good enough in
>>> most 'simple' cases - and it feels we're not exactly there right now.
>> 
>> The current problems seem to mainly come from things that jextract can 
>> not handle, like intrinsics or things like flexible arrays, and not 
>> having the ability to filter them with the current filter options.
> filtering things jextract cannot handle is one thing; the fact that
> jextract generates very big artifacts, often repeating lots of system
> libraries that the user of the extracted library will never need to
> refer to is also a problem, and one that I'm also interested in
> finding a solution for.
>> 
>>> Maybe we can't make the set as small as your root approach wants it 
>>> to
>>> be, but from where we are now (pull in all transitive closure of
>>> headers) and the minimum possible self-contained subset, I have to
>>> believe that there *has* to be some intermediate point that we can 
>>> get
>>> to.
>> 
>> We could also use the declarations from the explicitly passed headers 
>> as a root set, and then only include things from other headers if 
>> they're needed by the root headers.
> 
> This works and it's not far from where jextract was about an year ago
> - but there are cases (Python is one of them) where you have a main
> header which includes a myriad of other headers in the same folder
> (clang does this too, albeit to a much lesser extent). That's why the
> path-based filtering is so effective, at least in this cases. But yes,
> in principle you could start from there and collect the roots.
> 
> That said, my favorite approach has always been another: start from
> the _library_ and work out which header symbols need to be included.
> To me this is almost always what you need. There are, sure, some gray
> areas - for macros, enums and typedefs, and for those some sensible
> heuristics will need to be applied. But I think the set of symbols in
> a library/ies can be trusted in a way that a bunch of symbols in a
> bunch of headers cannot.

Structs can also fall into that grey area. E.g. if we have some generic 
data structure that uses void* as a type (to be generic) instead of 
'linking' to the explicit struct type.

Library symbols can also have junk. We can see this for instance in the 
Windows API, which defines a lot of functions doubly for ascii and 
unicode (with an A or W suffix). And then defines a macro as an alias 
for one of the two based on some other pre-processor setting. In that 
case, the library can't tell us which one should be included, only the 
header file.

> Think about it: from a user perspective, if you want to use e.g.
> clang, which functions do you expect to be available? Probably the
> ones defined here:
> 
> https://clang.llvm.org/doxygen/group__CINDEX.html
> 
> And, if you list all the symbols in the libclang.so/dll you get a list
> of symbols that matches that (modulo structs, which are obviously not
> in the shared lib, unless debug info is also provided) to a great
> degree of fidelity.

Well, you need more than just functions and global variables to use a 
library. In practice the header file gives what you need. We just need 
to find a good heuristic for filtering out the noise that comes with it. 
(I guess that problem also exists in the C/C++ world, maybe it's 
interesting to look at some solutions there?)

Starting from the list of library symbols seems like and interesting 
idea to minimize the output, but imho the header file is the more 
trustworthy source to draw information from.

But, I think what we can definitely agree on is that a jextract 
transitively including a bunch of system headers in the output is 
undesirable.

>> 
>> Of course, when a library has many headers, having to pass each one of 
>> them is tedious. But I'd say that falls outside of 'simple' case 
>> territory.
>> 
>>> So, this is really two things:
>>> 
>>> 1) finding a good set of heuristics that work in most cases (and I
>>> wouldn't frown on using path-based filtering - I had a similar
>>> reaction, but this has proven to work better than I hoped!)
>>> 
>>> 2) providing a good bunch of low level filtering (include/exclude)
>>> tools, actionable from the command line, when (1) goes wrong.
>> 
>> I've really only tried to solve 2. with this proposal, but tried to 
>> keep the path open for 1. as well. I think 1. is a much complexer 
>> problem to solve, so it seems good to first add a set of fine-grained 
>> filtering options as an escape hatch, and then play around with 
>> different filtering strategies, hopefully easier to implement because 
>> of work done for 2.
>> 
>> I think for now we can shelve the filter presets (or other automatic 
>> filtering strategies) and focus on adding fine-grained filtering 
>> options?
> 
> That would make me a bit sad because I think the real value of the
> work you did is that you started to tackle some of the problems in
> (1), which we wanted to tackle for a long time. I'm also skeptical
> that (2) can be tackled in isolation: to understand which filtering
> knob to expose, we need to have a clear model of how jextract is going
> to work in the 'default' case.
> 
> 
> I really think that your dependency analysis, and a lot of the choices
> you made around macros and enums look promising, and I'd really like
> to see them investigated a bit more; ideally we should play around
> with multiple strategies, compare their effectiveness and I guess once
> we try 2 or 3, some patterns will start to emerge (for instance we
> know already about the path-clustering of logically related headers,
> noticed by Sundar); at which point we will be able to formulate some
> hopefully robust heuristics, with an hopefully equally robust set of
> 'escape hatches' for when things go wrong.

I don't feel so strongly about the dependency analysis. I think it falls 
short in too many cases, especially when we already have a hand-crafted 
set of dependencies, i.e. header files. I think the dependency analysis 
should really only be used to emit warnings or errors when things that 
are needed to generate a well-formed artifact are missing, and let the 
user decide how to deal with the problem. Though, before we have a good 
mechanism for including dependencies for jextract runs, it seems fine to 
automatically include the dependencies of the root set as well.

The path-based filtering to determine a root set seems interesting to 
explore. It should be an overridable default imho. I'll continue 
exploring that.

Jorn

> 
> Maurizio
> 
> 
>> 
>> Jorn
>> 
>>> Maurizio
>>> 
>>>> 
>>>> On the other hand, not everything makes sense to use from a Panama 
>>>> perspective, so we still need some escape hatch to filter out some 
>>>> stuff we can't use, or breaks the binder. But, we'd like to go about 
>>>> that disciplined, and make sure we don't filter out things that are 
>>>> required by other things, so we use a dependency set.
>>>> 
>>>> Thoughts?
>>>> 
>>>> Jorn
>>>> 
>>>> Maurizio Cimadamore schreef op 2019-03-11 16:12:
>>>>> On 11/03/2019 13:45, Jorn Vernee wrote:
>>>>>> I can separate the parts of the patch a little bit into; Filter 
>>>>>> refactor + root set compute, and then leave the option changes out 
>>>>>> of it. But those 2 alone do not affect the filtering, since the 
>>>>>> root set is only used when filtering non-symbol/macro elements.
>>>>> 
>>>>> I guess then what I'm suggesting is to automatically filter out
>>>>> elements not in the root set, and see how that works out.
>>>>> 
>>>>> Maurizio