[foreign] RFC: Additional jextract filtering

Tue Mar 12 21:29:47 UTC 2019

Maurizio Cimadamore schreef op 2019-03-12 20:13:
> On 12/03/2019 18:28, Jorn Vernee wrote:
>> From your response I feel we are getting to the same wavelength :-)
>> 
>>> Here's a possible sketch:
>>> 
>>> 1) collect all function symbols in a given shared library
>>> 
>>> 2) collect the set of headers H defining the functions in (1)
>>> 
>>> 3) for each headerfile h in H, repeat these steps until the set H is 
>>> stable
>>> 
>>> b) for all function symbols, scan the signature of the function and
>>> pull in extra headers in H
>>> c) for all struct symbols, scan the struct field signatures and pull
>>> in extra headers in H
>>> 
>>> 4) add all symbols in H to the result set (including macros, enums,
>>> typedefs, ...)
>>> 
>>> 
>>> What do you think?
>> 
>> 1. is an interesting option, but it might not be possible to implement 
>> on all needed platforms. While all operating system seem to have some 
>> API for looking up a function in a shared library by name, there 
>> doesn't appear to be an API in POSIX or Windows for collecting all the 
>> symbols in a library. The prototype I have for Windows requires 
>> loading the library image and then crawling the export table. Some 
>> OSes might not support that approach.
>> 
>> Also, wouldn't 3b and 3c still pull in too much? e.g. if I declare a 
>> field with uintptr_t this would still pull in all of stdint.h? Isn't 
>> that basically the same as what we have now?
>> 
>> We could instead:
>> 
>> 1.) Start from the set of all functions and global vars transitively 
>> included in the headers passed to jextract.
>> 
>> 2.) Filter out any symbols that don't appear in the shared libraries.
>> 
>> 3.) From the remaining symbols compute a header root set H.
>> 
>> 4.) Include anything found in the headers in H.
>> 
>> 5.) Also include elements (but not entire headers) that are required 
>> by something appearing in H.
>> 
>> How does that sound?
> 
> This sounds close to what I had in mind - my step (1) is morally
> equivalent to your 1 + 2
> 
> 4 and 5 sounds good. I agree that my (4) was pulling it too much

Ok. Then I'll make a ticket at this point and start working on 
implementing this + tests.

>> 
>> Maybe at some point we could drop 5. and replace it with the need to 
>> pass an explicit dependency.
>> 
>> Also, probably, we also need a way of forcing a header into H, with an 
>> option. e.g. if a header only contains macros (i.e. a 'macro-header') 
>> it would always get filtered out otherwise.
> Yeah - maybe everything passed on the command line is implicitly added 
> to H?

Yeah, was thinking the same.

>> 
>> ---
>> 
>> Also, I have the next iteration of the prototype, which now includes 
>> the path-based filtering: 
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/filters/webrev.01/
>> 
>> If I run that over a header file with the following:
>> 
>> ```
>> #include <stdint.h>
>> 
>> uintptr_t x = 10;
>> ```
>> 
>> The output only contains a class with the global var, and a class with 
>> only the typedef annotation for uintptr_t. typdefs are currently also 
>> 'required', but they could be dropped of course. But, it's nice that 
>> this seems to "just work" for all our tests. They all pass.
> 
> If I understand correctly, this is the same as your previous patch -
> you do the dependency analysis and then you use the result to avoid
> filtering out too much stuff (e.g. something not on the right path
> might be needed after all). If so, this looks a promising start; I'd
> suggest to put together a simpler webrev with just these changes, for
> ease of review.

Well, I also removed the filtering options I added :) I kept a lot of 
the refactoring though, especially the separation between library 
symbols (i.e. functions and vars) and macros is needed. I'll remove the 
other filters for now.

Jorn

> Thanks
> Maurizio
> 
>> 
>> Jorn
>> 
>> Maurizio Cimadamore schreef op 2019-03-12 18:34:
>>> <snip>
>>>> I think it's also important to keep the heuristic simple. Which 
>>>> hopefully also makes it easy and straightforward to manipulate. What 
>>>> I like about the current approach is that it's so straight forward. 
>>>> You give jextract a header file, and get back a complete binding. If 
>>>> you want to make it smaller you could add filters.
>>>> 
>>>> I think some sort of automatic filtering (guess) would also need the 
>>>> ability to be turned off.
>>> Yes on both points. On simplicity, I think it's important not just in
>>> terms of implementation, but also in pedagogical terms (how hard it 
>>> is
>>> to explain what jextract does?)
>>>> <snip>
>>> 
>>>> Well, you need more than just functions and global variables to use 
>>>> a library. In practice the header file gives what you need. We just 
>>>> need to find a good heuristic for filtering out the noise that comes 
>>>> with it. (I guess that problem also exists in the C/C++ world, maybe 
>>>> it's interesting to look at some solutions there?)
>>>> 
>>>> Starting from the list of library symbols seems like and interesting 
>>>> idea to minimize the output, but imho the header file is the more 
>>>> trustworthy source to draw information from.
>>> The problem with headers is that they include other headers and so 
>>> all
>>> header-based approaches will have, at some point, to ask: when do I
>>> stop following dependencies?
>>>> 
>>>> But, I think what we can definitely agree on is that a jextract 
>>>> transitively including a bunch of system headers in the output is 
>>>> undesirable.
>>> Right - and again, while it might be simple to explain _why_ a 
>>> certain
>>> header has been pulled in, it could be totally surprising for an user
>>> to see so many symbols being pulled in for even relatively simple
>>> libraries.
>>> <snip>
>>>> I don't feel so strongly about the dependency analysis. I think it 
>>>> falls short in too many cases, especially when we already have a 
>>>> hand-crafted set of dependencies, i.e. header files. I think the 
>>>> dependency analysis should really only be used to emit warnings or 
>>>> errors when things that are needed to generate a well-formed 
>>>> artifact are missing, and let the user decide how to deal with the 
>>>> problem. Though, before we have a good mechanism for including 
>>>> dependencies for jextract runs, it seems fine to automatically 
>>>> include the dependencies of the root set as well.
>>>> 
>>>> The path-based filtering to determine a root set seems interesting 
>>>> to explore. It should be an overridable default imho. I'll continue 
>>>> exploring that.
>>> 
>>> Maybe we're using different terms - by dependency analysis I mean
>>> finding some root set of symbols to extract, and then use some
>>> analysis to pull in the symbols that will be required at runtime.
>>> 
>>> Here's a possible sketch:
>>> 
>>> 1) collect all function symbols in a given shared library
>>> 
>>> 2) collect the set of headers H defining the functions in (1)
>>> 
>>> 3) for each headerfile h in H, repeat these steps until the set H is 
>>> stable
>>> 
>>> b) for all function symbols, scan the signature of the function and
>>> pull in extra headers in H
>>> c) for all struct symbols, scan the struct field signatures and pull
>>> in extra headers in H
>>> 
>>> 4) add all symbols in H to the result set (including macros, enums,
>>> typedefs, ...)
>>> 
>>> 
>>> What do you think?
>>> 
>>> Maurizio
>>> 
>>> 
>>>> Jorn
>>>> 
>>>>> 
>>>>> Maurizio
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Jorn
>>>>>> 
>>>>>>> Maurizio
>>>>>>> 
>>>>>>>> 
>>>>>>>> On the other hand, not everything makes sense to use from a 
>>>>>>>> Panama perspective, so we still need some escape hatch to filter 
>>>>>>>> out some stuff we can't use, or breaks the binder. But, we'd 
>>>>>>>> like to go about that disciplined, and make sure we don't filter 
>>>>>>>> out things that are required by other things, so we use a 
>>>>>>>> dependency set.
>>>>>>>> 
>>>>>>>> Thoughts?
>>>>>>>> 
>>>>>>>> Jorn
>>>>>>>> 
>>>>>>>> Maurizio Cimadamore schreef op 2019-03-11 16:12:
>>>>>>>>> On 11/03/2019 13:45, Jorn Vernee wrote:
>>>>>>>>>> I can separate the parts of the patch a little bit into; 
>>>>>>>>>> Filter refactor + root set compute, and then leave the option 
>>>>>>>>>> changes out of it. But those 2 alone do not affect the 
>>>>>>>>>> filtering, since the root set is only used when filtering 
>>>>>>>>>> non-symbol/macro elements.
>>>>>>>>> 
>>>>>>>>> I guess then what I'm suggesting is to automatically filter out
>>>>>>>>> elements not in the root set, and see how that works out.
>>>>>>>>> 
>>>>>>>>> Maurizio