[foreign] RFC: Additional jextract filtering

Tue Mar 12 19:13:02 UTC 2019

On 12/03/2019 18:28, Jorn Vernee wrote:
> From your response I feel we are getting to the same wavelength :-)
>
>> Here's a possible sketch:
>>
>> 1) collect all function symbols in a given shared library
>>
>> 2) collect the set of headers H defining the functions in (1)
>>
>> 3) for each headerfile h in H, repeat these steps until the set H is 
>> stable
>>
>> b) for all function symbols, scan the signature of the function and
>> pull in extra headers in H
>> c) for all struct symbols, scan the struct field signatures and pull
>> in extra headers in H
>>
>> 4) add all symbols in H to the result set (including macros, enums,
>> typedefs, ...)
>>
>>
>> What do you think?
>
> 1. is an interesting option, but it might not be possible to implement 
> on all needed platforms. While all operating system seem to have some 
> API for looking up a function in a shared library by name, there 
> doesn't appear to be an API in POSIX or Windows for collecting all the 
> symbols in a library. The prototype I have for Windows requires 
> loading the library image and then crawling the export table. Some 
> OSes might not support that approach.
>
> Also, wouldn't 3b and 3c still pull in too much? e.g. if I declare a 
> field with uintptr_t this would still pull in all of stdint.h? Isn't 
> that basically the same as what we have now?
>
> We could instead:
>
> 1.) Start from the set of all functions and global vars transitively 
> included in the headers passed to jextract.
>
> 2.) Filter out any symbols that don't appear in the shared libraries.
>
> 3.) From the remaining symbols compute a header root set H.
>
> 4.) Include anything found in the headers in H.
>
> 5.) Also include elements (but not entire headers) that are required 
> by something appearing in H.
>
> How does that sound?

This sounds close to what I had in mind - my step (1) is morally 
equivalent to your 1 + 2

4 and 5 sounds good. I agree that my (4) was pulling it too much

>
> Maybe at some point we could drop 5. and replace it with the need to 
> pass an explicit dependency.
>
> Also, probably, we also need a way of forcing a header into H, with an 
> option. e.g. if a header only contains macros (i.e. a 'macro-header') 
> it would always get filtered out otherwise.
Yeah - maybe everything passed on the command line is implicitly added to H?
>
> ---
>
> Also, I have the next iteration of the prototype, which now includes 
> the path-based filtering: 
> http://cr.openjdk.java.net/~jvernee/panama/webrevs/filters/webrev.01/
>
> If I run that over a header file with the following:
>
> ```
> #include <stdint.h>
>
> uintptr_t x = 10;
> ```
>
> The output only contains a class with the global var, and a class with 
> only the typedef annotation for uintptr_t. typdefs are currently also 
> 'required', but they could be dropped of course. But, it's nice that 
> this seems to "just work" for all our tests. They all pass.

If I understand correctly, this is the same as your previous patch - you 
do the dependency analysis and then you use the result to avoid 
filtering out too much stuff (e.g. something not on the right path might 
be needed after all). If so, this looks a promising start; I'd suggest 
to put together a simpler webrev with just these changes, for ease of 
review.

Thanks
Maurizio

>
> Jorn
>
> Maurizio Cimadamore schreef op 2019-03-12 18:34:
>> <snip>
>>> I think it's also important to keep the heuristic simple. Which 
>>> hopefully also makes it easy and straightforward to manipulate. What 
>>> I like about the current approach is that it's so straight forward. 
>>> You give jextract a header file, and get back a complete binding. If 
>>> you want to make it smaller you could add filters.
>>>
>>> I think some sort of automatic filtering (guess) would also need the 
>>> ability to be turned off.
>> Yes on both points. On simplicity, I think it's important not just in
>> terms of implementation, but also in pedagogical terms (how hard it is
>> to explain what jextract does?)
>>> <snip>
>>
>>> Well, you need more than just functions and global variables to use 
>>> a library. In practice the header file gives what you need. We just 
>>> need to find a good heuristic for filtering out the noise that comes 
>>> with it. (I guess that problem also exists in the C/C++ world, maybe 
>>> it's interesting to look at some solutions there?)
>>>
>>> Starting from the list of library symbols seems like and interesting 
>>> idea to minimize the output, but imho the header file is the more 
>>> trustworthy source to draw information from.
>> The problem with headers is that they include other headers and so all
>> header-based approaches will have, at some point, to ask: when do I
>> stop following dependencies?
>>>
>>> But, I think what we can definitely agree on is that a jextract 
>>> transitively including a bunch of system headers in the output is 
>>> undesirable.
>> Right - and again, while it might be simple to explain _why_ a certain
>> header has been pulled in, it could be totally surprising for an user
>> to see so many symbols being pulled in for even relatively simple
>> libraries.
>> <snip>
>>> I don't feel so strongly about the dependency analysis. I think it 
>>> falls short in too many cases, especially when we already have a 
>>> hand-crafted set of dependencies, i.e. header files. I think the 
>>> dependency analysis should really only be used to emit warnings or 
>>> errors when things that are needed to generate a well-formed 
>>> artifact are missing, and let the user decide how to deal with the 
>>> problem. Though, before we have a good mechanism for including 
>>> dependencies for jextract runs, it seems fine to automatically 
>>> include the dependencies of the root set as well.
>>>
>>> The path-based filtering to determine a root set seems interesting 
>>> to explore. It should be an overridable default imho. I'll continue 
>>> exploring that.
>>
>> Maybe we're using different terms - by dependency analysis I mean
>> finding some root set of symbols to extract, and then use some
>> analysis to pull in the symbols that will be required at runtime.
>>
>> Here's a possible sketch:
>>
>> 1) collect all function symbols in a given shared library
>>
>> 2) collect the set of headers H defining the functions in (1)
>>
>> 3) for each headerfile h in H, repeat these steps until the set H is 
>> stable
>>
>> b) for all function symbols, scan the signature of the function and
>> pull in extra headers in H
>> c) for all struct symbols, scan the struct field signatures and pull
>> in extra headers in H
>>
>> 4) add all symbols in H to the result set (including macros, enums,
>> typedefs, ...)
>>
>>
>> What do you think?
>>
>> Maurizio
>>
>>
>>> Jorn
>>>
>>>>
>>>> Maurizio
>>>>
>>>>
>>>>>
>>>>> Jorn
>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>>>
>>>>>>> On the other hand, not everything makes sense to use from a 
>>>>>>> Panama perspective, so we still need some escape hatch to filter 
>>>>>>> out some stuff we can't use, or breaks the binder. But, we'd 
>>>>>>> like to go about that disciplined, and make sure we don't filter 
>>>>>>> out things that are required by other things, so we use a 
>>>>>>> dependency set.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> Jorn
>>>>>>>
>>>>>>> Maurizio Cimadamore schreef op 2019-03-11 16:12:
>>>>>>>> On 11/03/2019 13:45, Jorn Vernee wrote:
>>>>>>>>> I can separate the parts of the patch a little bit into; 
>>>>>>>>> Filter refactor + root set compute, and then leave the option 
>>>>>>>>> changes out of it. But those 2 alone do not affect the 
>>>>>>>>> filtering, since the root set is only used when filtering 
>>>>>>>>> non-symbol/macro elements.
>>>>>>>>
>>>>>>>> I guess then what I'm suggesting is to automatically filter out
>>>>>>>> elements not in the root set, and see how that works out.
>>>>>>>>
>>>>>>>> Maurizio