[foreign] RFR 8220544: Jextract; library dependent header filtering

Wed Mar 20 15:10:07 UTC 2019

Yes, but...

I think that implementation dependent behavior is a bit of the point 
here? That is, on a Windows machine you might want both uppercase and 
lowercase to match (given the OS makes no difference), unlike on Linux/Mac?

I'm not sure if we have any precedent for this - the use case for this 
API doesn't seem unreasonable; if we wanted to just use a regular regex 
(not a loose, OS-dependent one), then we probably have to normalize all 
paths coming out from clang, because I suspect we will find surprises in 
there.

Maurizio

On 20/03/2019 15:02, Sundararajan Athijegannathan wrote:
> Hmm..
>
> The javadoc says this:
>
> " For both the glob and regex syntaxes, the matching details, such as 
> whether the matching is case sensitive, are implementation-dependent 
> and therefore not specified."
>
> https://docs.oracle.com/javase/10/docs/api/java/nio/file/FileSystem.html
>
> Are we dependent on implementation dependent behavior?
>
> -Sundar
>
>
> On 20/03/19, 7:21 PM, Jorn Vernee wrote:
>> While trying to get a minimal artifact for the opengl sample I ran 
>> into a weird problem where clang was changing the casing of some of 
>> the system header paths, which was making the regex match fail. 
>> Casing of paths on Windows is mostly irrelevant, so that explains why 
>> this wasn't a problem before.
>>
>> After searching for a solution I arrived at 
>> FileSystem::getPathMatcher, i.e. a utility for matching paths already 
>> exists, and has FileSystem specific implementations (e.g. making the 
>> match case insensitive on Windows).
>>
>> Switching to that solves the problem: 
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/caps/webrev.00
>>
>> The numbers are:
>>
>> OpenGL:
>>   before: 4244087 bytes
>>   after:  106145 bytes (~40x improvement)
>>
>> Python:
>>   before: 482587 bytes
>>   after:  317240 bytes (~1.5x improvement)
>>
>> Tensorflow:
>>   before: 76915 bytes
>>   after:  41267 bytes (~1.9x improvement)
>>
>> The improvement is most significant for libraries that depend heavily 
>> on the Windows API (OpenGL) since there's a lot that can be filtered 
>> out.
>>
>> Cheers,
>> Jorn
>>
>> Maurizio Cimadamore schreef op 2019-03-19 22:06:
>>> Cool! This is very nice milestone for jextract :-)
>>>
>>> When you have time (and if you can) it would be nice to get some stats
>>> on extracting a library (e.g. Python) on Windows before/after, to get
>>> a sense of what degree of improvement in terms of static footprint we
>>> can get. I tested on Linux and I can get smaller jars (up to 2x) but I
>>> sense that the improvement is much more dramatic on Windows-land.
>>>
>>> Maurizio
>>>
>>> On 19/03/2019 19:58, Jorn Vernee wrote:
>>>> Thanks for the reviews. I've updated the comment and pushed.
>>>>
>>>> Jorn
>>>>
>>>> Sundararajan Athijegannathan schreef op 2019-03-19 18:07:
>>>>> Tests passed. Please go ahead with push.
>>>>>
>>>>> -Sundar
>>>>>
>>>>> On 19/03/19, 7:46 PM, Sundararajan Athijegannathan wrote:
>>>>>> * 47 * A merge between {@link DependencyFinder} and {@link 
>>>>>> ElementFilter}
>>>>>>
>>>>>> in DependencyFilter. Could be removed?
>>>>>>
>>>>>> No comments other than that. +1
>>>>>>
>>>>>> I'm going to submit an internal mach5 build/request for this 
>>>>>> patch. I'll notify you as soon as it is all green.
>>>>>>
>>>>>> -Sundar
>>>>>>
>>>>>> On 19/03/19, 7:34 PM, Maurizio Cimadamore wrote:
>>>>>>> This looks good to me, I have no further comments.
>>>>>>>
>>>>>>> Great work!
>>>>>>>
>>>>>>> (I'll leave it to Sundar for further comments/suggestions)
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>> On 19/03/2019 13:50, Jorn Vernee wrote:
>>>>>>>> Suggestions sound good.
>>>>>>>>
>>>>>>>> Here's the updated webrev: 
>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.07/ 
>>>>>>>>
>>>>>>>>
>>>>>>>> * Split library based filtering code off from SymbolFilter into 
>>>>>>>> LibraryLookupFilter.
>>>>>>>> * Merged DependencyFinder & ElementFilter into DependencyFilter 
>>>>>>>> (as nested classes).
>>>>>>>> * Also added 2 more test cases to check that symbols being 
>>>>>>>> filtered out don't have their dependencies added.
>>>>>>>>
>>>>>>>> Jorn
>>>>>>>>
>>>>>>>> Maurizio Cimadamore schreef op 2019-03-19 13:40:
>>>>>>>>> Looks good, I did some tests and it seems to behave according 
>>>>>>>>> to the
>>>>>>>>> expectations - some comments below:
>>>>>>>>>
>>>>>>>>> On 18/03/2019 10:58, Jorn Vernee wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have implemented this: 
>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.06/ 
>>>>>>>>>> This adds 2 options --include-headers and --exclude-headers 
>>>>>>>>>> that take regular expression patterns as an argument. The 
>>>>>>>>>> patterns are matched against a cursor's full header file path 
>>>>>>>>>> to determine if it appears in the header root set.
>>>>>>>>>>
>>>>>>>>>> For example, I can use the following to include headers in 
>>>>>>>>>> the Python include director;
>>>>>>>>>>
>>>>>>>>>>     --include-headers "J:\\Python27\\include\\.*"
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> To recap, we now:
>>>>>>>>>>   1.) Filter functions, vars, and macros by whether they 
>>>>>>>>>> appear in a root header.
>>>>>>>>>>   2.) Filter functions, vars, and macros by the 
>>>>>>>>>> --includes/exclude-symbols patterns.
>>>>>>>>>>   3.) Filter functions and vars, by whether they appear in 
>>>>>>>>>> any shared library.
>>>>>>>>>
>>>>>>>>> I think this is ok, but the way this happens in the code seems 
>>>>>>>>> to be
>>>>>>>>> sort of ad hoc, and probably the result of how the code use to 
>>>>>>>>> be in
>>>>>>>>> previous rounds.
>>>>>>>>>
>>>>>>>>> For instance, SymbolFilter does both (1), (2) and (3).
>>>>>>>>>
>>>>>>>>> Then we have DependencyFinder which also does a bit of (1).
>>>>>>>>>
>>>>>>>>> Finally we have ElementFilter, which drops everything that is not
>>>>>>>>> required (as per 4, 5).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I believe the code could be improved by:
>>>>>>>>>
>>>>>>>>> * separating the library-based filtering from the option-driven
>>>>>>>>> filtering. These are two separate mechanism (the latter being 
>>>>>>>>> optional
>>>>>>>>> on -L being used at all). Of course we should also decide 
>>>>>>>>> which is
>>>>>>>>> used first.
>>>>>>>>>
>>>>>>>>> * bring together DependencyFinder and ElementFilter into
>>>>>>>>> DependencyFilter. You can have a two pass visitor - collect 
>>>>>>>>> deps on a
>>>>>>>>> first pass, then prune on the second pass. This way the state 
>>>>>>>>> remains
>>>>>>>>> 'internal' to the DependencyFilter visitor, no need to pass magic
>>>>>>>>> state between two passes.
>>>>>>>>>
>>>>>>>>> Maurizio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>   4.) For the remaining trees in the root headers do 
>>>>>>>>>> dependency analysis, and find the required structs, enums, 
>>>>>>>>>> and typedefs.
>>>>>>>>>>   5.) For structs, enums, and typedefs include them if they 
>>>>>>>>>> appear in a root header OR are required as dependency.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Jorn
>>>>>>>>>>
>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:29:
>>>>>>>>>>> On 15/03/2019 18:21, Jorn Vernee wrote:
>>>>>>>>>>>> Heh - was writing an email pretty much with this suggestion :P
>>>>>>>>>>>>
>>>>>>>>>>>> I think this is a good idea! If you don't think this adds 
>>>>>>>>>>>> to much complexity to the options?
>>>>>>>>>>>>
>>>>>>>>>>>> So we'd basically have:
>>>>>>>>>>>>
>>>>>>>>>>>>   Without --include/exclude-header -> same behavior we have 
>>>>>>>>>>>> right now.
>>>>>>>>>>>>   With --include/exclude-header -> Add included headers to 
>>>>>>>>>>>> root set + find any dependencies, and put both in the 
>>>>>>>>>>>> generated artifact.
>>>>>>>>>>>>
>>>>>>>>>>>> I like that it does no filtering by default, so ad-hoc 
>>>>>>>>>>>> users don't have to figure out which header files define 
>>>>>>>>>>>> which functions.
>>>>>>>>>>>
>>>>>>>>>>> Yep - I like this too. It goes towards the 'no magic' goal, 
>>>>>>>>>>> and it
>>>>>>>>>>> adds the filter option that makes most sense when you want 
>>>>>>>>>>> to chop
>>>>>>>>>>> down an API for good. This is a filtering mechanism after 
>>>>>>>>>>> all, the
>>>>>>>>>>> fact that it's coming back as a filtering option is IMHO, a 
>>>>>>>>>>> plus, no a
>>>>>>>>>>> minus.
>>>>>>>>>>>
>>>>>>>>>>> Btw, slight revision to what I said previously - let's make the
>>>>>>>>>>> argument of include/exclude regexes too (as for others). 
>>>>>>>>>>> This way it
>>>>>>>>>>> can be even easier to use, most headers do have some 
>>>>>>>>>>> commonalities in
>>>>>>>>>>> them - and I like that, with that,  we can basically do the 
>>>>>>>>>>> path-based
>>>>>>>>>>> heuristics w/o hardwiring that into jextract.
>>>>>>>>>>>
>>>>>>>>>>> Maurizio
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Jorn
>>>>>>>>>>>>
>>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:07:
>>>>>>>>>>>>> On 15/03/2019 18:01, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 15/03/2019 17:48, Jorn Vernee wrote:
>>>>>>>>>>>>>>> I've implemented this, and now doing a clean build based 
>>>>>>>>>>>>>>> on the latest jdk/jdk merge before submitting the next 
>>>>>>>>>>>>>>> webrev.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's working nicely, with one caveat; Some headers rely 
>>>>>>>>>>>>>>> on pre-processor code defined in a parent header. For 
>>>>>>>>>>>>>>> instance, Python.h defines some pre-processor code used 
>>>>>>>>>>>>>>> in pythonrun.h . The example we have only uses functions 
>>>>>>>>>>>>>>> defined in pythonrun.h, but if we just pass that header 
>>>>>>>>>>>>>>> to jextract Clang throws an error because of the missing 
>>>>>>>>>>>>>>> pre-processor code from Python.h
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In this case we have to pass Python.h first, and then 
>>>>>>>>>>>>>>> pythonrun.h to get everything to work. This also relies 
>>>>>>>>>>>>>>> on the existence of header include guards in the 
>>>>>>>>>>>>>>> pythonrun.h header (since we're basically including it 
>>>>>>>>>>>>>>> twice). A similar caveat exists with the Windows API.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Doh! Maybe I've missed the simplest option after all.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What if the root set was a 'first class' concept in the 
>>>>>>>>>>>>>> extraction run, rather than something we infer from this 
>>>>>>>>>>>>>> or than command line option?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That way we could point jextract at python.h, but then 
>>>>>>>>>>>>>> say "hey, I'm only really interested at stuff that comes 
>>>>>>>>>>>>>> from pythonrun.h".
>>>>>>>>>>>>>
>>>>>>>>>>>>> In other words, right now we have
>>>>>>>>>>>>>
>>>>>>>>>>>>> --include-symbol
>>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>>>>>>>>>
>>>>>>>>>>>>> --exclude-symbol
>>>>>>>>>>>>>
>>>>>>>>>>>>> If not set, everything is included.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe all we need is:
>>>>>>>>>>>>>
>>>>>>>>>>>>> --include-header <header>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --exclude-header <header>
>>>>>>>>>>>>>
>>>>>>>>>>>>> And, again, if none is specified, all headers are part of 
>>>>>>>>>>>>> the 'root set'.
>>>>>>>>>>>>>
>>>>>>>>>>>>> And, to address a concern you had - yes, I'd consider 
>>>>>>>>>>>>> using the
>>>>>>>>>>>>> 'include path' for header names in --include/exclude-header
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So, the guide-line is: Pass the main header first, then 
>>>>>>>>>>>>>>> internal headers. e.g. If we have a main header A.h 
>>>>>>>>>>>>>>> which includes a_impl.h, and another main header B.h 
>>>>>>>>>>>>>>> which includes b_impl.h, headers should be passed to 
>>>>>>>>>>>>>>> jextract in the order: A.h a_impl.h B.h b_impl.h
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think the behavior being dependent on the ordering of 
>>>>>>>>>>>>>>> the headers could be fixed by sorting the headers in 
>>>>>>>>>>>>>>> topological order. But, there's still the requirement to 
>>>>>>>>>>>>>>> pass all of them, or things break.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Any thoughs about that?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jorn Vernee schreef op 2019-03-15 14:40:
>>>>>>>>>>>>>>>> I've already been using shell scripts mostly when 
>>>>>>>>>>>>>>>> running jextract
>>>>>>>>>>>>>>>> (except for simple examples). I find it very useful to 
>>>>>>>>>>>>>>>> be able to
>>>>>>>>>>>>>>>> split the command over multiple lines, especially long 
>>>>>>>>>>>>>>>> file paths
>>>>>>>>>>>>>>>> become much more readable.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll start working on this then.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 14:21:
>>>>>>>>>>>>>>>>> On 15/03/2019 13:15, Jorn Vernee wrote:
>>>>>>>>>>>>>>>>>> I still like this approach, and I think adding 
>>>>>>>>>>>>>>>>>> support for wildcard patterns and/or header filters 
>>>>>>>>>>>>>>>>>> would make it better.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Like you said: It's dead simple. What you pass to 
>>>>>>>>>>>>>>>>>> jextract is what you get. Though, we could still 
>>>>>>>>>>>>>>>>>> apply dependency analysis, which would make sure 
>>>>>>>>>>>>>>>>>> nothing that's needed gets dropped.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Why don't we try this then?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Note that in javac we often use this trick:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> javac `find <path> -name *.java`
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> which works
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> also, note that jextract also accepts a file with the 
>>>>>>>>>>>>>>>>> @ syntax, etc.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> jextract @args.txt
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> where args.txt is the command line (which could list 
>>>>>>>>>>>>>>>>> all the headers you want!).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maybe this is indeed the simpler approach.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maurizio