[foreign] RFR 8220544: Jextract; library dependent header filtering

Wed Mar 20 15:02:50 UTC 2019

Hmm..

The javadoc says this:

" For both the glob and regex syntaxes, the matching details, such as 
whether the matching is case sensitive, are implementation-dependent and 
therefore not specified."

https://docs.oracle.com/javase/10/docs/api/java/nio/file/FileSystem.html

Are we dependent on implementation dependent behavior?

-Sundar

On 20/03/19, 7:21 PM, Jorn Vernee wrote:
> While trying to get a minimal artifact for the opengl sample I ran 
> into a weird problem where clang was changing the casing of some of 
> the system header paths, which was making the regex match fail. Casing 
> of paths on Windows is mostly irrelevant, so that explains why this 
> wasn't a problem before.
>
> After searching for a solution I arrived at 
> FileSystem::getPathMatcher, i.e. a utility for matching paths already 
> exists, and has FileSystem specific implementations (e.g. making the 
> match case insensitive on Windows).
>
> Switching to that solves the problem: 
> http://cr.openjdk.java.net/~jvernee/panama/webrevs/caps/webrev.00
>
> The numbers are:
>
> OpenGL:
>   before: 4244087 bytes
>   after:  106145 bytes (~40x improvement)
>
> Python:
>   before: 482587 bytes
>   after:  317240 bytes (~1.5x improvement)
>
> Tensorflow:
>   before: 76915 bytes
>   after:  41267 bytes (~1.9x improvement)
>
> The improvement is most significant for libraries that depend heavily 
> on the Windows API (OpenGL) since there's a lot that can be filtered out.
>
> Cheers,
> Jorn
>
> Maurizio Cimadamore schreef op 2019-03-19 22:06:
>> Cool! This is very nice milestone for jextract :-)
>>
>> When you have time (and if you can) it would be nice to get some stats
>> on extracting a library (e.g. Python) on Windows before/after, to get
>> a sense of what degree of improvement in terms of static footprint we
>> can get. I tested on Linux and I can get smaller jars (up to 2x) but I
>> sense that the improvement is much more dramatic on Windows-land.
>>
>> Maurizio
>>
>> On 19/03/2019 19:58, Jorn Vernee wrote:
>>> Thanks for the reviews. I've updated the comment and pushed.
>>>
>>> Jorn
>>>
>>> Sundararajan Athijegannathan schreef op 2019-03-19 18:07:
>>>> Tests passed. Please go ahead with push.
>>>>
>>>> -Sundar
>>>>
>>>> On 19/03/19, 7:46 PM, Sundararajan Athijegannathan wrote:
>>>>> * 47 * A merge between {@link DependencyFinder} and {@link 
>>>>> ElementFilter}
>>>>>
>>>>> in DependencyFilter. Could be removed?
>>>>>
>>>>> No comments other than that. +1
>>>>>
>>>>> I'm going to submit an internal mach5 build/request for this 
>>>>> patch. I'll notify you as soon as it is all green.
>>>>>
>>>>> -Sundar
>>>>>
>>>>> On 19/03/19, 7:34 PM, Maurizio Cimadamore wrote:
>>>>>> This looks good to me, I have no further comments.
>>>>>>
>>>>>> Great work!
>>>>>>
>>>>>> (I'll leave it to Sundar for further comments/suggestions)
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>> On 19/03/2019 13:50, Jorn Vernee wrote:
>>>>>>> Suggestions sound good.
>>>>>>>
>>>>>>> Here's the updated webrev: 
>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.07/ 
>>>>>>>
>>>>>>>
>>>>>>> * Split library based filtering code off from SymbolFilter into 
>>>>>>> LibraryLookupFilter.
>>>>>>> * Merged DependencyFinder & ElementFilter into DependencyFilter 
>>>>>>> (as nested classes).
>>>>>>> * Also added 2 more test cases to check that symbols being 
>>>>>>> filtered out don't have their dependencies added.
>>>>>>>
>>>>>>> Jorn
>>>>>>>
>>>>>>> Maurizio Cimadamore schreef op 2019-03-19 13:40:
>>>>>>>> Looks good, I did some tests and it seems to behave according 
>>>>>>>> to the
>>>>>>>> expectations - some comments below:
>>>>>>>>
>>>>>>>> On 18/03/2019 10:58, Jorn Vernee wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have implemented this: 
>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.06/ 
>>>>>>>>> This adds 2 options --include-headers and --exclude-headers 
>>>>>>>>> that take regular expression patterns as an argument. The 
>>>>>>>>> patterns are matched against a cursor's full header file path 
>>>>>>>>> to determine if it appears in the header root set.
>>>>>>>>>
>>>>>>>>> For example, I can use the following to include headers in the 
>>>>>>>>> Python include director;
>>>>>>>>>
>>>>>>>>>     --include-headers "J:\\Python27\\include\\.*"
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> To recap, we now:
>>>>>>>>>   1.) Filter functions, vars, and macros by whether they 
>>>>>>>>> appear in a root header.
>>>>>>>>>   2.) Filter functions, vars, and macros by the 
>>>>>>>>> --includes/exclude-symbols patterns.
>>>>>>>>>   3.) Filter functions and vars, by whether they appear in any 
>>>>>>>>> shared library.
>>>>>>>>
>>>>>>>> I think this is ok, but the way this happens in the code seems 
>>>>>>>> to be
>>>>>>>> sort of ad hoc, and probably the result of how the code use to 
>>>>>>>> be in
>>>>>>>> previous rounds.
>>>>>>>>
>>>>>>>> For instance, SymbolFilter does both (1), (2) and (3).
>>>>>>>>
>>>>>>>> Then we have DependencyFinder which also does a bit of (1).
>>>>>>>>
>>>>>>>> Finally we have ElementFilter, which drops everything that is not
>>>>>>>> required (as per 4, 5).
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe the code could be improved by:
>>>>>>>>
>>>>>>>> * separating the library-based filtering from the option-driven
>>>>>>>> filtering. These are two separate mechanism (the latter being 
>>>>>>>> optional
>>>>>>>> on -L being used at all). Of course we should also decide which is
>>>>>>>> used first.
>>>>>>>>
>>>>>>>> * bring together DependencyFinder and ElementFilter into
>>>>>>>> DependencyFilter. You can have a two pass visitor - collect 
>>>>>>>> deps on a
>>>>>>>> first pass, then prune on the second pass. This way the state 
>>>>>>>> remains
>>>>>>>> 'internal' to the DependencyFilter visitor, no need to pass magic
>>>>>>>> state between two passes.
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>>
>>>>>>>>>   4.) For the remaining trees in the root headers do 
>>>>>>>>> dependency analysis, and find the required structs, enums, and 
>>>>>>>>> typedefs.
>>>>>>>>>   5.) For structs, enums, and typedefs include them if they 
>>>>>>>>> appear in a root header OR are required as dependency.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Jorn
>>>>>>>>>
>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:29:
>>>>>>>>>> On 15/03/2019 18:21, Jorn Vernee wrote:
>>>>>>>>>>> Heh - was writing an email pretty much with this suggestion :P
>>>>>>>>>>>
>>>>>>>>>>> I think this is a good idea! If you don't think this adds to 
>>>>>>>>>>> much complexity to the options?
>>>>>>>>>>>
>>>>>>>>>>> So we'd basically have:
>>>>>>>>>>>
>>>>>>>>>>>   Without --include/exclude-header -> same behavior we have 
>>>>>>>>>>> right now.
>>>>>>>>>>>   With --include/exclude-header -> Add included headers to 
>>>>>>>>>>> root set + find any dependencies, and put both in the 
>>>>>>>>>>> generated artifact.
>>>>>>>>>>>
>>>>>>>>>>> I like that it does no filtering by default, so ad-hoc users 
>>>>>>>>>>> don't have to figure out which header files define which 
>>>>>>>>>>> functions.
>>>>>>>>>>
>>>>>>>>>> Yep - I like this too. It goes towards the 'no magic' goal, 
>>>>>>>>>> and it
>>>>>>>>>> adds the filter option that makes most sense when you want to 
>>>>>>>>>> chop
>>>>>>>>>> down an API for good. This is a filtering mechanism after 
>>>>>>>>>> all, the
>>>>>>>>>> fact that it's coming back as a filtering option is IMHO, a 
>>>>>>>>>> plus, no a
>>>>>>>>>> minus.
>>>>>>>>>>
>>>>>>>>>> Btw, slight revision to what I said previously - let's make the
>>>>>>>>>> argument of include/exclude regexes too (as for others). This 
>>>>>>>>>> way it
>>>>>>>>>> can be even easier to use, most headers do have some 
>>>>>>>>>> commonalities in
>>>>>>>>>> them - and I like that, with that,  we can basically do the 
>>>>>>>>>> path-based
>>>>>>>>>> heuristics w/o hardwiring that into jextract.
>>>>>>>>>>
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Jorn
>>>>>>>>>>>
>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:07:
>>>>>>>>>>>> On 15/03/2019 18:01, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 15/03/2019 17:48, Jorn Vernee wrote:
>>>>>>>>>>>>>> I've implemented this, and now doing a clean build based 
>>>>>>>>>>>>>> on the latest jdk/jdk merge before submitting the next 
>>>>>>>>>>>>>> webrev.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's working nicely, with one caveat; Some headers rely 
>>>>>>>>>>>>>> on pre-processor code defined in a parent header. For 
>>>>>>>>>>>>>> instance, Python.h defines some pre-processor code used 
>>>>>>>>>>>>>> in pythonrun.h . The example we have only uses functions 
>>>>>>>>>>>>>> defined in pythonrun.h, but if we just pass that header 
>>>>>>>>>>>>>> to jextract Clang throws an error because of the missing 
>>>>>>>>>>>>>> pre-processor code from Python.h
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In this case we have to pass Python.h first, and then 
>>>>>>>>>>>>>> pythonrun.h to get everything to work. This also relies 
>>>>>>>>>>>>>> on the existence of header include guards in the 
>>>>>>>>>>>>>> pythonrun.h header (since we're basically including it 
>>>>>>>>>>>>>> twice). A similar caveat exists with the Windows API.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Doh! Maybe I've missed the simplest option after all.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What if the root set was a 'first class' concept in the 
>>>>>>>>>>>>> extraction run, rather than something we infer from this 
>>>>>>>>>>>>> or than command line option?
>>>>>>>>>>>>>
>>>>>>>>>>>>> That way we could point jextract at python.h, but then say 
>>>>>>>>>>>>> "hey, I'm only really interested at stuff that comes from 
>>>>>>>>>>>>> pythonrun.h".
>>>>>>>>>>>>
>>>>>>>>>>>> In other words, right now we have
>>>>>>>>>>>>
>>>>>>>>>>>> --include-symbol
>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>>>>>>>>>>
>>>>>>>>>>>> --exclude-symbol
>>>>>>>>>>>>
>>>>>>>>>>>> If not set, everything is included.
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe all we need is:
>>>>>>>>>>>>
>>>>>>>>>>>> --include-header <header>
>>>>>>>>>>>>
>>>>>>>>>>>> --exclude-header <header>
>>>>>>>>>>>>
>>>>>>>>>>>> And, again, if none is specified, all headers are part of 
>>>>>>>>>>>> the 'root set'.
>>>>>>>>>>>>
>>>>>>>>>>>> And, to address a concern you had - yes, I'd consider using 
>>>>>>>>>>>> the
>>>>>>>>>>>> 'include path' for header names in --include/exclude-header
>>>>>>>>>>>>
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, the guide-line is: Pass the main header first, then 
>>>>>>>>>>>>>> internal headers. e.g. If we have a main header A.h which 
>>>>>>>>>>>>>> includes a_impl.h, and another main header B.h which 
>>>>>>>>>>>>>> includes b_impl.h, headers should be passed to jextract 
>>>>>>>>>>>>>> in the order: A.h a_impl.h B.h b_impl.h
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the behavior being dependent on the ordering of 
>>>>>>>>>>>>>> the headers could be fixed by sorting the headers in 
>>>>>>>>>>>>>> topological order. But, there's still the requirement to 
>>>>>>>>>>>>>> pass all of them, or things break.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any thoughs about that?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jorn Vernee schreef op 2019-03-15 14:40:
>>>>>>>>>>>>>>> I've already been using shell scripts mostly when 
>>>>>>>>>>>>>>> running jextract
>>>>>>>>>>>>>>> (except for simple examples). I find it very useful to 
>>>>>>>>>>>>>>> be able to
>>>>>>>>>>>>>>> split the command over multiple lines, especially long 
>>>>>>>>>>>>>>> file paths
>>>>>>>>>>>>>>> become much more readable.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'll start working on this then.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 14:21:
>>>>>>>>>>>>>>>> On 15/03/2019 13:15, Jorn Vernee wrote:
>>>>>>>>>>>>>>>>> I still like this approach, and I think adding support 
>>>>>>>>>>>>>>>>> for wildcard patterns and/or header filters would make 
>>>>>>>>>>>>>>>>> it better.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Like you said: It's dead simple. What you pass to 
>>>>>>>>>>>>>>>>> jextract is what you get. Though, we could still apply 
>>>>>>>>>>>>>>>>> dependency analysis, which would make sure nothing 
>>>>>>>>>>>>>>>>> that's needed gets dropped.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Why don't we try this then?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Note that in javac we often use this trick:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> javac `find <path> -name *.java`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> which works
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> also, note that jextract also accepts a file with the @ 
>>>>>>>>>>>>>>>> syntax, etc.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> jextract @args.txt
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> where args.txt is the command line (which could list 
>>>>>>>>>>>>>>>> all the headers you want!).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maybe this is indeed the simpler approach.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maurizio