[foreign] RFR 8220544: Jextract; library dependent header filtering

Wed Mar 20 15:51:43 UTC 2019

Well, no, it's not documented, but it is implemented that way on Windows 
[1]:

     // match in unicode_case_insensitive
     final Pattern pattern = Pattern.compile(expr,
         Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);

But that's just how it goes with implementation dependent APIs right?

In general; I'd say we want to take maximum advantage of file system 
leniency when matching paths, and on Windows this also makes path 
matching case insensitive.

Jorn

[1] : 
http://hg.openjdk.java.net/jdk/jdk/file/83cace4142c8/src/java.base/windows/classes/sun/nio/fs/WindowsFileSystem.java#l278

Sundararajan Athijegannathan schreef op 2019-03-20 16:39:
> That API you're using does not seem to guarantee anything case
> matching - so lenient on Windows does not seem to be guaranteed,
> right?
> 
> PS. "the default Unix file system _is_ case sensitive". No, not Mac -
> as usual mac is different :)
> 
> -Sundar
> 
> On 20/03/19, 8:53 PM, Jorn Vernee wrote:
>> For the header path matching, we are currently not implementation 
>> dependent. But, my suggestion is to be lenient towards path casing on 
>> Windows. We could make the current pattern match case insensitive, but 
>> the case sensitivity of paths depends on the file system, so using a 
>> file system dependent utility seems appropriate. AFAICT the default 
>> Unix file system _is_ case sensitive. So I think we want to retain 
>> case sensitive matching in that case?
>> 
>> Note that we are currently always using the default file system when 
>> creating paths for SourceLocations, because we rely on Paths::get to 
>> create them.
>> 
>> Jorn
>> 
>> Sundararajan Athijegannathan schreef op 2019-03-20 16:02:
>>> Hmm..
>>> 
>>> The javadoc says this:
>>> 
>>> " For both the glob and regex syntaxes, the matching details, such as
>>> whether the matching is case sensitive, are implementation-dependent
>>> and therefore not specified."
>>> https://docs.oracle.com/javase/10/docs/api/java/nio/file/FileSystem.html
>>> 
>>> Are we dependent on implementation dependent behavior?
>>> 
>>> -Sundar
>>> 
>>> On 20/03/19, 7:21 PM, Jorn Vernee wrote:
>>> 
>>>> While trying to get a minimal artifact for the opengl sample I ran
>>>> into a weird problem where clang was changing the casing of some of
>>>> the system header paths, which was making the regex match fail.
>>>> Casing of paths on Windows is mostly irrelevant, so that explains
>>>> why this wasn't a problem before.
>>>> 
>>>> After searching for a solution I arrived at
>>>> FileSystem::getPathMatcher, i.e. a utility for matching paths
>>>> already exists, and has FileSystem specific implementations (e.g.
>>>> making the match case insensitive on Windows).
>>>> 
>>>> Switching to that solves the problem:
>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/caps/webrev.00
>>>> 
>>>> The numbers are:
>>>> 
>>>> OpenGL:
>>>> before: 4244087 bytes
>>>> after:  106145 bytes (~40x improvement)
>>>> 
>>>> Python:
>>>> before: 482587 bytes
>>>> after:  317240 bytes (~1.5x improvement)
>>>> 
>>>> Tensorflow:
>>>> before: 76915 bytes
>>>> after:  41267 bytes (~1.9x improvement)
>>>> 
>>>> The improvement is most significant for libraries that depend
>>>> heavily on the Windows API (OpenGL) since there's a lot that can be
>>>> filtered out.
>>>> 
>>>> Cheers,
>>>> Jorn
>>>> 
>>>> Maurizio Cimadamore schreef op 2019-03-19 22:06:
>>>> Cool! This is very nice milestone for jextract :-)
>>>> 
>>>> When you have time (and if you can) it would be nice to get some
>>>> stats
>>>> on extracting a library (e.g. Python) on Windows before/after, to
>>>> get
>>>> a sense of what degree of improvement in terms of static footprint
>>>> we
>>>> can get. I tested on Linux and I can get smaller jars (up to 2x) but
>>>> I
>>>> sense that the improvement is much more dramatic on Windows-land.
>>>> 
>>>> Maurizio
>>>> 
>>>> On 19/03/2019 19:58, Jorn Vernee wrote:
>>>> Thanks for the reviews. I've updated the comment and pushed.
>>>> 
>>>> Jorn
>>>> 
>>>> Sundararajan Athijegannathan schreef op 2019-03-19 18:07:
>>>> Tests passed. Please go ahead with push.
>>>> 
>>>> -Sundar
>>>> 
>>>> On 19/03/19, 7:46 PM, Sundararajan Athijegannathan wrote:
>>>> * 47 * A merge between {@link DependencyFinder} and {@link
>>>> ElementFilter}
>>>> 
>>>> in DependencyFilter. Could be removed?
>>>> 
>>>> No comments other than that. +1
>>>> 
>>>> I'm going to submit an internal mach5 build/request for this patch.
>>>> I'll notify you as soon as it is all green.
>>>> 
>>>> -Sundar
>>>> 
>>>> On 19/03/19, 7:34 PM, Maurizio Cimadamore wrote:
>>>> This looks good to me, I have no further comments.
>>>> 
>>>> Great work!
>>>> 
>>>> (I'll leave it to Sundar for further comments/suggestions)
>>>> 
>>>> Maurizio
>>>> 
>>>> On 19/03/2019 13:50, Jorn Vernee wrote:
>>>> Suggestions sound good.
>>>> 
>>>> Here's the updated webrev:
>>>> 
>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.07/
>>>> 
>>>> 
>>>> * Split library based filtering code off from SymbolFilter into
>>>> LibraryLookupFilter.
>>>> * Merged DependencyFinder & ElementFilter into DependencyFilter (as
>>>> nested classes).
>>>> * Also added 2 more test cases to check that symbols being filtered
>>>> out don't have their dependencies added.
>>>> 
>>>> Jorn
>>>> 
>>>> Maurizio Cimadamore schreef op 2019-03-19 13:40:
>>>> Looks good, I did some tests and it seems to behave according to the
>>>> 
>>>> expectations - some comments below:
>>>> 
>>>> On 18/03/2019 10:58, Jorn Vernee wrote:
>>>> Hi,
>>>> 
>>>> I have implemented this:
>>>> 
>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.06/
>>>> This adds 2 options --include-headers and --exclude-headers that
>>>> take regular expression patterns as an argument. The patterns are
>>>> matched against a cursor's full header file path to determine if it
>>>> appears in the header root set.
>>>> 
>>>> For example, I can use the following to include headers in the
>>>> Python include director;
>>>> 
>>>> --include-headers "J:\\Python27\\include\\.*"
>>>> 
>>>> ---
>>>> 
>>>> To recap, we now:
>>>> 1.) Filter functions, vars, and macros by whether they appear in a
>>>> root header.
>>>> 2.) Filter functions, vars, and macros by the
>>>> --includes/exclude-symbols patterns.
>>>> 3.) Filter functions and vars, by whether they appear in any
>>>> shared library.
>>>> 
>>>> I think this is ok, but the way this happens in the code seems to be
>>>> 
>>>> sort of ad hoc, and probably the result of how the code use to be in
>>>> 
>>>> previous rounds.
>>>> 
>>>> For instance, SymbolFilter does both (1), (2) and (3).
>>>> 
>>>> Then we have DependencyFinder which also does a bit of (1).
>>>> 
>>>> Finally we have ElementFilter, which drops everything that is not
>>>> required (as per 4, 5).
>>>> 
>>>> I believe the code could be improved by:
>>>> 
>>>> * separating the library-based filtering from the option-driven
>>>> filtering. These are two separate mechanism (the latter being
>>>> optional
>>>> on -L being used at all). Of course we should also decide which is
>>>> used first.
>>>> 
>>>> * bring together DependencyFinder and ElementFilter into
>>>> DependencyFilter. You can have a two pass visitor - collect deps on
>>>> a
>>>> first pass, then prune on the second pass. This way the state
>>>> remains
>>>> 'internal' to the DependencyFilter visitor, no need to pass magic
>>>> state between two passes.
>>>> 
>>>> Maurizio
>>>> 
>>>> 4.) For the remaining trees in the root headers do dependency
>>>> analysis, and find the required structs, enums, and typedefs.
>>>> 5.) For structs, enums, and typedefs include them if they appear
>>>> in a root header OR are required as dependency.
>>>> 
>>>> Cheers,
>>>> Jorn
>>>> 
>>>> Maurizio Cimadamore schreef op 2019-03-15 19:29:
>>>> On 15/03/2019 18:21, Jorn Vernee wrote:
>>>> Heh - was writing an email pretty much with this suggestion :P
>>>> 
>>>> I think this is a good idea! If you don't think this adds to much
>>>> complexity to the options?
>>>> 
>>>> So we'd basically have:
>>>> 
>>>> Without --include/exclude-header -> same behavior we have right
>>>> now.
>>>> With --include/exclude-header -> Add included headers to root set
>>>> + find any dependencies, and put both in the generated artifact.
>>>> 
>>>> I like that it does no filtering by default, so ad-hoc users don't
>>>> have to figure out which header files define which functions.
>>>> 
>>>> Yep - I like this too. It goes towards the 'no magic' goal, and it
>>>> adds the filter option that makes most sense when you want to chop
>>>> down an API for good. This is a filtering mechanism after all, the
>>>> fact that it's coming back as a filtering option is IMHO, a plus, no
>>>> a
>>>> minus.
>>>> 
>>>> Btw, slight revision to what I said previously - let's make the
>>>> argument of include/exclude regexes too (as for others). This way it
>>>> 
>>>> can be even easier to use, most headers do have some commonalities
>>>> in
>>>> them - and I like that, with that,  we can basically do the
>>>> path-based
>>>> heuristics w/o hardwiring that into jextract.
>>>> 
>>>> Maurizio
>>>> 
>>>> Jorn
>>>> 
>>>> Maurizio Cimadamore schreef op 2019-03-15 19:07:
>>>> On 15/03/2019 18:01, Maurizio Cimadamore wrote:
>>>> 
>>>> On 15/03/2019 17:48, Jorn Vernee wrote:
>>>> I've implemented this, and now doing a clean build based on the
>>>> latest jdk/jdk merge before submitting the next webrev.
>>>> 
>>>> It's working nicely, with one caveat; Some headers rely on
>>>> pre-processor code defined in a parent header. For instance,
>>>> Python.h defines some pre-processor code used in pythonrun.h . The
>>>> example we have only uses functions defined in pythonrun.h, but if
>>>> we just pass that header to jextract Clang throws an error because
>>>> of the missing pre-processor code from Python.h
>>>> 
>>>> In this case we have to pass Python.h first, and then pythonrun.h to
>>>> get everything to work. This also relies on the existence of header
>>>> include guards in the pythonrun.h header (since we're basically
>>>> including it twice). A similar caveat exists with the Windows API.
>>>> 
>>>> Doh! Maybe I've missed the simplest option after all.
>>>> 
>>>> What if the root set was a 'first class' concept in the extraction
>>>> run, rather than something we infer from this or than command line
>>>> option?
>>>> 
>>>> That way we could point jextract at python.h, but then say "hey, I'm
>>>> only really interested at stuff that comes from pythonrun.h".
>>> 
>>> In other words, right now we have
>>> 
>>> --include-symbol
>>> 
>>> and
>>> 
>>> --exclude-symbol
>>> 
>>> If not set, everything is included.
>>> 
>>> Maybe all we need is:
>>> 
>>> --include-header <header>
>>> 
>>> --exclude-header <header>
>>> 
>>> And, again, if none is specified, all headers are part of the 'root
>>> set'.
>>> 
>>> And, to address a concern you had - yes, I'd consider using the
>>> 'include path' for header names in --include/exclude-header
>>> 
>>> Maurizio
>>> 
>>>> Maurizio
>>>> 
>>>> So, the guide-line is: Pass the main header first, then internal
>>>> headers. e.g. If we have a main header A.h which includes a_impl.h,
>>>> and another main header B.h which includes b_impl.h, headers should
>>>> be passed to jextract in the order: A.h a_impl.h B.h b_impl.h
>>>> 
>>>> I think the behavior being dependent on the ordering of the headers
>>>> could be fixed by sorting the headers in topological order. But,
>>>> there's still the requirement to pass all of them, or things break.
>>>> 
>>>> Any thoughs about that?
>>>> 
>>>> Thanks,
>>>> Jorn
>>>> 
>>>> Jorn Vernee schreef op 2019-03-15 14:40:
>>>> I've already been using shell scripts mostly when running jextract
>>>> (except for simple examples). I find it very useful to be able to
>>>> split the command over multiple lines, especially long file paths
>>>> become much more readable.
>>>> 
>>>> I'll start working on this then.
>>>> 
>>>> Jorn
>>>> 
>>>> Maurizio Cimadamore schreef op 2019-03-15 14:21:
>>>> On 15/03/2019 13:15, Jorn Vernee wrote:
>>>> I still like this approach, and I think adding support for wildcard
>>>> patterns and/or header filters would make it better.
>>>> 
>>>> Like you said: It's dead simple. What you pass to jextract is what
>>>> you get. Though, we could still apply dependency analysis, which
>>>> would make sure nothing that's needed gets dropped.
>>>> 
>>>> Why don't we try this then?
>>>> 
>>>> Note that in javac we often use this trick:
>>>> 
>>>> javac `find <path> -name *.java`
>>>> 
>>>> which works
>>>> 
>>>> also, note that jextract also accepts a file with the @ syntax, etc.
>>>> 
>>>> 
>>>> jextract @args.txt
>>>> 
>>>> where args.txt is the command line (which could list all the headers
>>>> you want!).
>>>> 
>>>> Maybe this is indeed the simpler approach.
>>>> 
>>>> Maurizio