[foreign] RFC: Additional jextract filtering

Mon Mar 11 23:52:41 UTC 2019

On 11/03/2019 22:30, Jorn Vernee wrote:
> Maurizio Cimadamore schreef op 2019-03-11 22:33:
>> On 11/03/2019 15:44, Jorn Vernee wrote:
>>> Well, I tried changing the defaults to using the REQUIRED preset for 
>>> structs, enums and typedefs, and this is making a bunch (11) of the 
>>> tests fail, including TestJextractFFI, which is not a good sign 
>>> imho. This seems to be happening because the root set computation 
>>> can not deal with the pimpl/opaque pointer idiom.
>> I'd like to understand more about this failure mode - example please?
>
> I have to look into this more thoroughly...
>
> Basically, Index.h defines a type like this: `typedef struct 
> CXTranslationUnitImpl *CXTranslationUnit;` but there is no definition 
> of CXTranslationUnitImpl anywhere. Something like that can be used to 
> encapsulate an implementation. Currently this type is not being 
> included in the dependency set, but to be fair that is probably a 
> fault of the implementation. There are other failing tests because the 
> test headers in a lot of cases declare only a struct or union, but no 
> function that uses them, so they get filtered out.
>
> Any ways, the point I failed to make, is that the root set is still 
> just a 'guess' at best, so I don't believe it should be the default.

Well, that might well be the case; my opinion is that the _whole 
business of inferring an API from a set of header_ is a guess at best, 
with (suboptima) choices at every step of the way - how to filter, how 
to name, how to map to carriers, etc. Generally speaking, jextract  
picks a point in this complex space and hopes that 'it sticks'; but I 
wouldn't read the standard behavior of jextract we happen to have right 
now as fundamentally 'more correct' than the one you were looking to 
replace it with. Both are, in some sense or another, based on heuristics.

So, if we happen to stumble on an heuristics that gives us a more 
compact Java API in most of the non-complicated cases (where by this I 
mostly mean: something other than a system library), I think I'd prefer 
that to what we do now any day of the week.

>
>>> But, let's go back to the underlying goal; we want to create 
>>> jextract output with the least 'junk' possible. I'd say this is not 
>>> the job of jextract, but the job of the library maintainer. The 
>>> header file is the interface for using the library, so it should 
>>> contain things that are all more or less required to use the 
>>> library. I don't think we will have much success trying to 
>>> 'outsmart' the writer of the header file. After all, jextract does 
>>> it's filtering automatically, and the header file is carefully 
>>> hand-crafted.
>>
>> In principle, I agree - and that's why, longer term, we'll have APIs
>> that will let you plugin custom filters too. At the same time I think
>> that the out-of-the-box jextract experience should be good enough in
>> most 'simple' cases - and it feels we're not exactly there right now.
>
> The current problems seem to mainly come from things that jextract can 
> not handle, like intrinsics or things like flexible arrays, and not 
> having the ability to filter them with the current filter options.
filtering things jextract cannot handle is one thing; the fact that 
jextract generates very big artifacts, often repeating lots of system 
libraries that the user of the extracted library will never need to 
refer to is also a problem, and one that I'm also interested in finding 
a solution for.
>
>> Maybe we can't make the set as small as your root approach wants it to
>> be, but from where we are now (pull in all transitive closure of
>> headers) and the minimum possible self-contained subset, I have to
>> believe that there *has* to be some intermediate point that we can get
>> to.
>
> We could also use the declarations from the explicitly passed headers 
> as a root set, and then only include things from other headers if 
> they're needed by the root headers.

This works and it's not far from where jextract was about an year ago - 
but there are cases (Python is one of them) where you have a main header 
which includes a myriad of other headers in the same folder (clang does 
this too, albeit to a much lesser extent). That's why the path-based 
filtering is so effective, at least in this cases. But yes, in principle 
you could start from there and collect the roots.

That said, my favorite approach has always been another: start from the 
_library_ and work out which header symbols need to be included. To me 
this is almost always what you need. There are, sure, some gray areas - 
for macros, enums and typedefs, and for those some sensible heuristics 
will need to be applied. But I think the set of symbols in a library/ies 
can be trusted in a way that a bunch of symbols in a bunch of headers 
cannot.

Think about it: from a user perspective, if you want to use e.g. clang, 
which functions do you expect to be available? Probably the ones defined 
here:

https://clang.llvm.org/doxygen/group__CINDEX.html

And, if you list all the symbols in the libclang.so/dll you get a list 
of symbols that matches that (modulo structs, which are obviously not in 
the shared lib, unless debug info is also provided) to a great degree of 
fidelity.

>
> Of course, when a library has many headers, having to pass each one of 
> them is tedious. But I'd say that falls outside of 'simple' case 
> territory.
>
>> So, this is really two things:
>>
>> 1) finding a good set of heuristics that work in most cases (and I
>> wouldn't frown on using path-based filtering - I had a similar
>> reaction, but this has proven to work better than I hoped!)
>>
>> 2) providing a good bunch of low level filtering (include/exclude)
>> tools, actionable from the command line, when (1) goes wrong.
>
> I've really only tried to solve 2. with this proposal, but tried to 
> keep the path open for 1. as well. I think 1. is a much complexer 
> problem to solve, so it seems good to first add a set of fine-grained 
> filtering options as an escape hatch, and then play around with 
> different filtering strategies, hopefully easier to implement because 
> of work done for 2.
>
> I think for now we can shelve the filter presets (or other automatic 
> filtering strategies) and focus on adding fine-grained filtering options?

That would make me a bit sad because I think the real value of the work 
you did is that you started to tackle some of the problems in (1), which 
we wanted to tackle for a long time. I'm also skeptical that (2) can be 
tackled in isolation: to understand which filtering knob to expose, we 
need to have a clear model of how jextract is going to work in the 
'default' case.

I really think that your dependency analysis, and a lot of the choices 
you made around macros and enums look promising, and I'd really like to 
see them investigated a bit more; ideally we should play around with 
multiple strategies, compare their effectiveness and I guess once we try 
2 or 3, some patterns will start to emerge (for instance we know already 
about the path-clustering of logically related headers, noticed by 
Sundar); at which point we will be able to formulate some hopefully 
robust heuristics, with an hopefully equally robust set of 'escape 
hatches' for when things go wrong.

Maurizio

>
> Jorn
>
>> Maurizio
>>
>>>
>>> On the other hand, not everything makes sense to use from a Panama 
>>> perspective, so we still need some escape hatch to filter out some 
>>> stuff we can't use, or breaks the binder. But, we'd like to go about 
>>> that disciplined, and make sure we don't filter out things that are 
>>> required by other things, so we use a dependency set.
>>>
>>> Thoughts?
>>>
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-03-11 16:12:
>>>> On 11/03/2019 13:45, Jorn Vernee wrote:
>>>>> I can separate the parts of the patch a little bit into; Filter 
>>>>> refactor + root set compute, and then leave the option changes out 
>>>>> of it. But those 2 alone do not affect the filtering, since the 
>>>>> root set is only used when filtering non-symbol/macro elements.
>>>>
>>>> I guess then what I'm suggesting is to automatically filter out
>>>> elements not in the root set, and see how that works out.
>>>>
>>>> Maurizio