[foreign] RFR 8220544: Jextract; library dependent header filtering

Wed Mar 20 15:23:07 UTC 2019

For the header path matching, we are currently not implementation 
dependent. But, my suggestion is to be lenient towards path casing on 
Windows. We could make the current pattern match case insensitive, but 
the case sensitivity of paths depends on the file system, so using a 
file system dependent utility seems appropriate. AFAICT the default Unix 
file system _is_ case sensitive. So I think we want to retain case 
sensitive matching in that case?

Note that we are currently always using the default file system when 
creating paths for SourceLocations, because we rely on Paths::get to 
create them.

Jorn

Sundararajan Athijegannathan schreef op 2019-03-20 16:02:
> Hmm..
> 
> The javadoc says this:
> 
> " For both the glob and regex syntaxes, the matching details, such as
> whether the matching is case sensitive, are implementation-dependent
> and therefore not specified."
> https://docs.oracle.com/javase/10/docs/api/java/nio/file/FileSystem.html
> 
> Are we dependent on implementation dependent behavior?
> 
> -Sundar
> 
> On 20/03/19, 7:21 PM, Jorn Vernee wrote:
> 
>> While trying to get a minimal artifact for the opengl sample I ran
>> into a weird problem where clang was changing the casing of some of
>> the system header paths, which was making the regex match fail.
>> Casing of paths on Windows is mostly irrelevant, so that explains
>> why this wasn't a problem before.
>> 
>> After searching for a solution I arrived at
>> FileSystem::getPathMatcher, i.e. a utility for matching paths
>> already exists, and has FileSystem specific implementations (e.g.
>> making the match case insensitive on Windows).
>> 
>> Switching to that solves the problem:
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/caps/webrev.00
>> 
>> The numbers are:
>> 
>> OpenGL:
>> before: 4244087 bytes
>> after:  106145 bytes (~40x improvement)
>> 
>> Python:
>> before: 482587 bytes
>> after:  317240 bytes (~1.5x improvement)
>> 
>> Tensorflow:
>> before: 76915 bytes
>> after:  41267 bytes (~1.9x improvement)
>> 
>> The improvement is most significant for libraries that depend
>> heavily on the Windows API (OpenGL) since there's a lot that can be
>> filtered out.
>> 
>> Cheers,
>> Jorn
>> 
>> Maurizio Cimadamore schreef op 2019-03-19 22:06:
>> Cool! This is very nice milestone for jextract :-)
>> 
>> When you have time (and if you can) it would be nice to get some
>> stats
>> on extracting a library (e.g. Python) on Windows before/after, to
>> get
>> a sense of what degree of improvement in terms of static footprint
>> we
>> can get. I tested on Linux and I can get smaller jars (up to 2x) but
>> I
>> sense that the improvement is much more dramatic on Windows-land.
>> 
>> Maurizio
>> 
>> On 19/03/2019 19:58, Jorn Vernee wrote:
>> Thanks for the reviews. I've updated the comment and pushed.
>> 
>> Jorn
>> 
>> Sundararajan Athijegannathan schreef op 2019-03-19 18:07:
>> Tests passed. Please go ahead with push.
>> 
>> -Sundar
>> 
>> On 19/03/19, 7:46 PM, Sundararajan Athijegannathan wrote:
>> * 47 * A merge between {@link DependencyFinder} and {@link
>> ElementFilter}
>> 
>> in DependencyFilter. Could be removed?
>> 
>> No comments other than that. +1
>> 
>> I'm going to submit an internal mach5 build/request for this patch.
>> I'll notify you as soon as it is all green.
>> 
>> -Sundar
>> 
>> On 19/03/19, 7:34 PM, Maurizio Cimadamore wrote:
>> This looks good to me, I have no further comments.
>> 
>> Great work!
>> 
>> (I'll leave it to Sundar for further comments/suggestions)
>> 
>> Maurizio
>> 
>> On 19/03/2019 13:50, Jorn Vernee wrote:
>> Suggestions sound good.
>> 
>> Here's the updated webrev:
>> 
> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.07/
>> 
>> 
>> * Split library based filtering code off from SymbolFilter into
>> LibraryLookupFilter.
>> * Merged DependencyFinder & ElementFilter into DependencyFilter (as
>> nested classes).
>> * Also added 2 more test cases to check that symbols being filtered
>> out don't have their dependencies added.
>> 
>> Jorn
>> 
>> Maurizio Cimadamore schreef op 2019-03-19 13:40:
>> Looks good, I did some tests and it seems to behave according to the
>> 
>> expectations - some comments below:
>> 
>> On 18/03/2019 10:58, Jorn Vernee wrote:
>> Hi,
>> 
>> I have implemented this:
>> 
> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.06/
>> This adds 2 options --include-headers and --exclude-headers that
>> take regular expression patterns as an argument. The patterns are
>> matched against a cursor's full header file path to determine if it
>> appears in the header root set.
>> 
>> For example, I can use the following to include headers in the
>> Python include director;
>> 
>> --include-headers "J:\\Python27\\include\\.*"
>> 
>> ---
>> 
>> To recap, we now:
>> 1.) Filter functions, vars, and macros by whether they appear in a
>> root header.
>> 2.) Filter functions, vars, and macros by the
>> --includes/exclude-symbols patterns.
>> 3.) Filter functions and vars, by whether they appear in any
>> shared library.
>> 
>> I think this is ok, but the way this happens in the code seems to be
>> 
>> sort of ad hoc, and probably the result of how the code use to be in
>> 
>> previous rounds.
>> 
>> For instance, SymbolFilter does both (1), (2) and (3).
>> 
>> Then we have DependencyFinder which also does a bit of (1).
>> 
>> Finally we have ElementFilter, which drops everything that is not
>> required (as per 4, 5).
>> 
>> I believe the code could be improved by:
>> 
>> * separating the library-based filtering from the option-driven
>> filtering. These are two separate mechanism (the latter being
>> optional
>> on -L being used at all). Of course we should also decide which is
>> used first.
>> 
>> * bring together DependencyFinder and ElementFilter into
>> DependencyFilter. You can have a two pass visitor - collect deps on
>> a
>> first pass, then prune on the second pass. This way the state
>> remains
>> 'internal' to the DependencyFilter visitor, no need to pass magic
>> state between two passes.
>> 
>> Maurizio
>> 
>> 4.) For the remaining trees in the root headers do dependency
>> analysis, and find the required structs, enums, and typedefs.
>> 5.) For structs, enums, and typedefs include them if they appear
>> in a root header OR are required as dependency.
>> 
>> Cheers,
>> Jorn
>> 
>> Maurizio Cimadamore schreef op 2019-03-15 19:29:
>> On 15/03/2019 18:21, Jorn Vernee wrote:
>> Heh - was writing an email pretty much with this suggestion :P
>> 
>> I think this is a good idea! If you don't think this adds to much
>> complexity to the options?
>> 
>> So we'd basically have:
>> 
>> Without --include/exclude-header -> same behavior we have right
>> now.
>> With --include/exclude-header -> Add included headers to root set
>> + find any dependencies, and put both in the generated artifact.
>> 
>> I like that it does no filtering by default, so ad-hoc users don't
>> have to figure out which header files define which functions.
>> 
>> Yep - I like this too. It goes towards the 'no magic' goal, and it
>> adds the filter option that makes most sense when you want to chop
>> down an API for good. This is a filtering mechanism after all, the
>> fact that it's coming back as a filtering option is IMHO, a plus, no
>> a
>> minus.
>> 
>> Btw, slight revision to what I said previously - let's make the
>> argument of include/exclude regexes too (as for others). This way it
>> 
>> can be even easier to use, most headers do have some commonalities
>> in
>> them - and I like that, with that,  we can basically do the
>> path-based
>> heuristics w/o hardwiring that into jextract.
>> 
>> Maurizio
>> 
>> Jorn
>> 
>> Maurizio Cimadamore schreef op 2019-03-15 19:07:
>> On 15/03/2019 18:01, Maurizio Cimadamore wrote:
>> 
>> On 15/03/2019 17:48, Jorn Vernee wrote:
>> I've implemented this, and now doing a clean build based on the
>> latest jdk/jdk merge before submitting the next webrev.
>> 
>> It's working nicely, with one caveat; Some headers rely on
>> pre-processor code defined in a parent header. For instance,
>> Python.h defines some pre-processor code used in pythonrun.h . The
>> example we have only uses functions defined in pythonrun.h, but if
>> we just pass that header to jextract Clang throws an error because
>> of the missing pre-processor code from Python.h
>> 
>> In this case we have to pass Python.h first, and then pythonrun.h to
>> get everything to work. This also relies on the existence of header
>> include guards in the pythonrun.h header (since we're basically
>> including it twice). A similar caveat exists with the Windows API.
>> 
>> Doh! Maybe I've missed the simplest option after all.
>> 
>> What if the root set was a 'first class' concept in the extraction
>> run, rather than something we infer from this or than command line
>> option?
>> 
>> That way we could point jextract at python.h, but then say "hey, I'm
>> only really interested at stuff that comes from pythonrun.h".
> 
> In other words, right now we have
> 
> --include-symbol
> 
> and
> 
> --exclude-symbol
> 
> If not set, everything is included.
> 
> Maybe all we need is:
> 
> --include-header <header>
> 
> --exclude-header <header>
> 
> And, again, if none is specified, all headers are part of the 'root
> set'.
> 
> And, to address a concern you had - yes, I'd consider using the
> 'include path' for header names in --include/exclude-header
> 
> Maurizio
> 
>> Maurizio
>> 
>> So, the guide-line is: Pass the main header first, then internal
>> headers. e.g. If we have a main header A.h which includes a_impl.h,
>> and another main header B.h which includes b_impl.h, headers should
>> be passed to jextract in the order: A.h a_impl.h B.h b_impl.h
>> 
>> I think the behavior being dependent on the ordering of the headers
>> could be fixed by sorting the headers in topological order. But,
>> there's still the requirement to pass all of them, or things break.
>> 
>> Any thoughs about that?
>> 
>> Thanks,
>> Jorn
>> 
>> Jorn Vernee schreef op 2019-03-15 14:40:
>> I've already been using shell scripts mostly when running jextract
>> (except for simple examples). I find it very useful to be able to
>> split the command over multiple lines, especially long file paths
>> become much more readable.
>> 
>> I'll start working on this then.
>> 
>> Jorn
>> 
>> Maurizio Cimadamore schreef op 2019-03-15 14:21:
>> On 15/03/2019 13:15, Jorn Vernee wrote:
>> I still like this approach, and I think adding support for wildcard
>> patterns and/or header filters would make it better.
>> 
>> Like you said: It's dead simple. What you pass to jextract is what
>> you get. Though, we could still apply dependency analysis, which
>> would make sure nothing that's needed gets dropped.
>> 
>> Why don't we try this then?
>> 
>> Note that in javac we often use this trick:
>> 
>> javac `find <path> -name *.java`
>> 
>> which works
>> 
>> also, note that jextract also accepts a file with the @ syntax, etc.
>> 
>> 
>> jextract @args.txt
>> 
>> where args.txt is the command line (which could list all the headers
>> you want!).
>> 
>> Maybe this is indeed the simpler approach.
>> 
>> Maurizio