[foreign] RFR 8220544: Jextract; library dependent header filtering

Wed Mar 20 15:39:58 UTC 2019

That API you're using does not seem to guarantee anything case matching 
- so lenient on Windows does not seem to be guaranteed, right?

PS. "the default Unix file system _is_ case sensitive". No, not Mac - as 
usual mac is different :)

-Sundar

On 20/03/19, 8:53 PM, Jorn Vernee wrote:
> For the header path matching, we are currently not implementation 
> dependent. But, my suggestion is to be lenient towards path casing on 
> Windows. We could make the current pattern match case insensitive, but 
> the case sensitivity of paths depends on the file system, so using a 
> file system dependent utility seems appropriate. AFAICT the default 
> Unix file system _is_ case sensitive. So I think we want to retain 
> case sensitive matching in that case?
>
> Note that we are currently always using the default file system when 
> creating paths for SourceLocations, because we rely on Paths::get to 
> create them.
>
> Jorn
>
> Sundararajan Athijegannathan schreef op 2019-03-20 16:02:
>> Hmm..
>>
>> The javadoc says this:
>>
>> " For both the glob and regex syntaxes, the matching details, such as
>> whether the matching is case sensitive, are implementation-dependent
>> and therefore not specified."
>> https://docs.oracle.com/javase/10/docs/api/java/nio/file/FileSystem.html
>>
>> Are we dependent on implementation dependent behavior?
>>
>> -Sundar
>>
>> On 20/03/19, 7:21 PM, Jorn Vernee wrote:
>>
>>> While trying to get a minimal artifact for the opengl sample I ran
>>> into a weird problem where clang was changing the casing of some of
>>> the system header paths, which was making the regex match fail.
>>> Casing of paths on Windows is mostly irrelevant, so that explains
>>> why this wasn't a problem before.
>>>
>>> After searching for a solution I arrived at
>>> FileSystem::getPathMatcher, i.e. a utility for matching paths
>>> already exists, and has FileSystem specific implementations (e.g.
>>> making the match case insensitive on Windows).
>>>
>>> Switching to that solves the problem:
>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/caps/webrev.00
>>>
>>> The numbers are:
>>>
>>> OpenGL:
>>> before: 4244087 bytes
>>> after:  106145 bytes (~40x improvement)
>>>
>>> Python:
>>> before: 482587 bytes
>>> after:  317240 bytes (~1.5x improvement)
>>>
>>> Tensorflow:
>>> before: 76915 bytes
>>> after:  41267 bytes (~1.9x improvement)
>>>
>>> The improvement is most significant for libraries that depend
>>> heavily on the Windows API (OpenGL) since there's a lot that can be
>>> filtered out.
>>>
>>> Cheers,
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-03-19 22:06:
>>> Cool! This is very nice milestone for jextract :-)
>>>
>>> When you have time (and if you can) it would be nice to get some
>>> stats
>>> on extracting a library (e.g. Python) on Windows before/after, to
>>> get
>>> a sense of what degree of improvement in terms of static footprint
>>> we
>>> can get. I tested on Linux and I can get smaller jars (up to 2x) but
>>> I
>>> sense that the improvement is much more dramatic on Windows-land.
>>>
>>> Maurizio
>>>
>>> On 19/03/2019 19:58, Jorn Vernee wrote:
>>> Thanks for the reviews. I've updated the comment and pushed.
>>>
>>> Jorn
>>>
>>> Sundararajan Athijegannathan schreef op 2019-03-19 18:07:
>>> Tests passed. Please go ahead with push.
>>>
>>> -Sundar
>>>
>>> On 19/03/19, 7:46 PM, Sundararajan Athijegannathan wrote:
>>> * 47 * A merge between {@link DependencyFinder} and {@link
>>> ElementFilter}
>>>
>>> in DependencyFilter. Could be removed?
>>>
>>> No comments other than that. +1
>>>
>>> I'm going to submit an internal mach5 build/request for this patch.
>>> I'll notify you as soon as it is all green.
>>>
>>> -Sundar
>>>
>>> On 19/03/19, 7:34 PM, Maurizio Cimadamore wrote:
>>> This looks good to me, I have no further comments.
>>>
>>> Great work!
>>>
>>> (I'll leave it to Sundar for further comments/suggestions)
>>>
>>> Maurizio
>>>
>>> On 19/03/2019 13:50, Jorn Vernee wrote:
>>> Suggestions sound good.
>>>
>>> Here's the updated webrev:
>>>
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.07/
>>>
>>>
>>> * Split library based filtering code off from SymbolFilter into
>>> LibraryLookupFilter.
>>> * Merged DependencyFinder & ElementFilter into DependencyFilter (as
>>> nested classes).
>>> * Also added 2 more test cases to check that symbols being filtered
>>> out don't have their dependencies added.
>>>
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-03-19 13:40:
>>> Looks good, I did some tests and it seems to behave according to the
>>>
>>> expectations - some comments below:
>>>
>>> On 18/03/2019 10:58, Jorn Vernee wrote:
>>> Hi,
>>>
>>> I have implemented this:
>>>
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.06/
>>> This adds 2 options --include-headers and --exclude-headers that
>>> take regular expression patterns as an argument. The patterns are
>>> matched against a cursor's full header file path to determine if it
>>> appears in the header root set.
>>>
>>> For example, I can use the following to include headers in the
>>> Python include director;
>>>
>>> --include-headers "J:\\Python27\\include\\.*"
>>>
>>> ---
>>>
>>> To recap, we now:
>>> 1.) Filter functions, vars, and macros by whether they appear in a
>>> root header.
>>> 2.) Filter functions, vars, and macros by the
>>> --includes/exclude-symbols patterns.
>>> 3.) Filter functions and vars, by whether they appear in any
>>> shared library.
>>>
>>> I think this is ok, but the way this happens in the code seems to be
>>>
>>> sort of ad hoc, and probably the result of how the code use to be in
>>>
>>> previous rounds.
>>>
>>> For instance, SymbolFilter does both (1), (2) and (3).
>>>
>>> Then we have DependencyFinder which also does a bit of (1).
>>>
>>> Finally we have ElementFilter, which drops everything that is not
>>> required (as per 4, 5).
>>>
>>> I believe the code could be improved by:
>>>
>>> * separating the library-based filtering from the option-driven
>>> filtering. These are two separate mechanism (the latter being
>>> optional
>>> on -L being used at all). Of course we should also decide which is
>>> used first.
>>>
>>> * bring together DependencyFinder and ElementFilter into
>>> DependencyFilter. You can have a two pass visitor - collect deps on
>>> a
>>> first pass, then prune on the second pass. This way the state
>>> remains
>>> 'internal' to the DependencyFilter visitor, no need to pass magic
>>> state between two passes.
>>>
>>> Maurizio
>>>
>>> 4.) For the remaining trees in the root headers do dependency
>>> analysis, and find the required structs, enums, and typedefs.
>>> 5.) For structs, enums, and typedefs include them if they appear
>>> in a root header OR are required as dependency.
>>>
>>> Cheers,
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-03-15 19:29:
>>> On 15/03/2019 18:21, Jorn Vernee wrote:
>>> Heh - was writing an email pretty much with this suggestion :P
>>>
>>> I think this is a good idea! If you don't think this adds to much
>>> complexity to the options?
>>>
>>> So we'd basically have:
>>>
>>> Without --include/exclude-header -> same behavior we have right
>>> now.
>>> With --include/exclude-header -> Add included headers to root set
>>> + find any dependencies, and put both in the generated artifact.
>>>
>>> I like that it does no filtering by default, so ad-hoc users don't
>>> have to figure out which header files define which functions.
>>>
>>> Yep - I like this too. It goes towards the 'no magic' goal, and it
>>> adds the filter option that makes most sense when you want to chop
>>> down an API for good. This is a filtering mechanism after all, the
>>> fact that it's coming back as a filtering option is IMHO, a plus, no
>>> a
>>> minus.
>>>
>>> Btw, slight revision to what I said previously - let's make the
>>> argument of include/exclude regexes too (as for others). This way it
>>>
>>> can be even easier to use, most headers do have some commonalities
>>> in
>>> them - and I like that, with that,  we can basically do the
>>> path-based
>>> heuristics w/o hardwiring that into jextract.
>>>
>>> Maurizio
>>>
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-03-15 19:07:
>>> On 15/03/2019 18:01, Maurizio Cimadamore wrote:
>>>
>>> On 15/03/2019 17:48, Jorn Vernee wrote:
>>> I've implemented this, and now doing a clean build based on the
>>> latest jdk/jdk merge before submitting the next webrev.
>>>
>>> It's working nicely, with one caveat; Some headers rely on
>>> pre-processor code defined in a parent header. For instance,
>>> Python.h defines some pre-processor code used in pythonrun.h . The
>>> example we have only uses functions defined in pythonrun.h, but if
>>> we just pass that header to jextract Clang throws an error because
>>> of the missing pre-processor code from Python.h
>>>
>>> In this case we have to pass Python.h first, and then pythonrun.h to
>>> get everything to work. This also relies on the existence of header
>>> include guards in the pythonrun.h header (since we're basically
>>> including it twice). A similar caveat exists with the Windows API.
>>>
>>> Doh! Maybe I've missed the simplest option after all.
>>>
>>> What if the root set was a 'first class' concept in the extraction
>>> run, rather than something we infer from this or than command line
>>> option?
>>>
>>> That way we could point jextract at python.h, but then say "hey, I'm
>>> only really interested at stuff that comes from pythonrun.h".
>>
>> In other words, right now we have
>>
>> --include-symbol
>>
>> and
>>
>> --exclude-symbol
>>
>> If not set, everything is included.
>>
>> Maybe all we need is:
>>
>> --include-header <header>
>>
>> --exclude-header <header>
>>
>> And, again, if none is specified, all headers are part of the 'root
>> set'.
>>
>> And, to address a concern you had - yes, I'd consider using the
>> 'include path' for header names in --include/exclude-header
>>
>> Maurizio
>>
>>> Maurizio
>>>
>>> So, the guide-line is: Pass the main header first, then internal
>>> headers. e.g. If we have a main header A.h which includes a_impl.h,
>>> and another main header B.h which includes b_impl.h, headers should
>>> be passed to jextract in the order: A.h a_impl.h B.h b_impl.h
>>>
>>> I think the behavior being dependent on the ordering of the headers
>>> could be fixed by sorting the headers in topological order. But,
>>> there's still the requirement to pass all of them, or things break.
>>>
>>> Any thoughs about that?
>>>
>>> Thanks,
>>> Jorn
>>>
>>> Jorn Vernee schreef op 2019-03-15 14:40:
>>> I've already been using shell scripts mostly when running jextract
>>> (except for simple examples). I find it very useful to be able to
>>> split the command over multiple lines, especially long file paths
>>> become much more readable.
>>>
>>> I'll start working on this then.
>>>
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-03-15 14:21:
>>> On 15/03/2019 13:15, Jorn Vernee wrote:
>>> I still like this approach, and I think adding support for wildcard
>>> patterns and/or header filters would make it better.
>>>
>>> Like you said: It's dead simple. What you pass to jextract is what
>>> you get. Though, we could still apply dependency analysis, which
>>> would make sure nothing that's needed gets dropped.
>>>
>>> Why don't we try this then?
>>>
>>> Note that in javac we often use this trick:
>>>
>>> javac `find <path> -name *.java`
>>>
>>> which works
>>>
>>> also, note that jextract also accepts a file with the @ syntax, etc.
>>>
>>>
>>> jextract @args.txt
>>>
>>> where args.txt is the command line (which could list all the headers
>>> you want!).
>>>
>>> Maybe this is indeed the simpler approach.
>>>
>>> Maurizio