[foreign] RFR 8220544: Jextract; library dependent header filtering

Wed Mar 20 16:45:46 UTC 2019

sounds good. +1

-Sundar

On 20/03/19, 9:21 PM, Jorn Vernee wrote:
> Well, no, it's not documented, but it is implemented that way on 
> Windows [1]:
>
>     // match in unicode_case_insensitive
>     final Pattern pattern = Pattern.compile(expr,
>         Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
>
> But that's just how it goes with implementation dependent APIs right?
>
> In general; I'd say we want to take maximum advantage of file system 
> leniency when matching paths, and on Windows this also makes path 
> matching case insensitive.
>
> Jorn
>
> [1] : 
> http://hg.openjdk.java.net/jdk/jdk/file/83cace4142c8/src/java.base/windows/classes/sun/nio/fs/WindowsFileSystem.java#l278
>
> Sundararajan Athijegannathan schreef op 2019-03-20 16:39:
>> That API you're using does not seem to guarantee anything case
>> matching - so lenient on Windows does not seem to be guaranteed,
>> right?
>>
>> PS. "the default Unix file system _is_ case sensitive". No, not Mac -
>> as usual mac is different :)
>>
>> -Sundar
>>
>> On 20/03/19, 8:53 PM, Jorn Vernee wrote:
>>> For the header path matching, we are currently not implementation 
>>> dependent. But, my suggestion is to be lenient towards path casing 
>>> on Windows. We could make the current pattern match case 
>>> insensitive, but the case sensitivity of paths depends on the file 
>>> system, so using a file system dependent utility seems appropriate. 
>>> AFAICT the default Unix file system _is_ case sensitive. So I think 
>>> we want to retain case sensitive matching in that case?
>>>
>>> Note that we are currently always using the default file system when 
>>> creating paths for SourceLocations, because we rely on Paths::get to 
>>> create them.
>>>
>>> Jorn
>>>
>>> Sundararajan Athijegannathan schreef op 2019-03-20 16:02:
>>>> Hmm..
>>>>
>>>> The javadoc says this:
>>>>
>>>> " For both the glob and regex syntaxes, the matching details, such as
>>>> whether the matching is case sensitive, are implementation-dependent
>>>> and therefore not specified."
>>>> https://docs.oracle.com/javase/10/docs/api/java/nio/file/FileSystem.html 
>>>>
>>>>
>>>> Are we dependent on implementation dependent behavior?
>>>>
>>>> -Sundar
>>>>
>>>> On 20/03/19, 7:21 PM, Jorn Vernee wrote:
>>>>
>>>>> While trying to get a minimal artifact for the opengl sample I ran
>>>>> into a weird problem where clang was changing the casing of some of
>>>>> the system header paths, which was making the regex match fail.
>>>>> Casing of paths on Windows is mostly irrelevant, so that explains
>>>>> why this wasn't a problem before.
>>>>>
>>>>> After searching for a solution I arrived at
>>>>> FileSystem::getPathMatcher, i.e. a utility for matching paths
>>>>> already exists, and has FileSystem specific implementations (e.g.
>>>>> making the match case insensitive on Windows).
>>>>>
>>>>> Switching to that solves the problem:
>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/caps/webrev.00
>>>>>
>>>>> The numbers are:
>>>>>
>>>>> OpenGL:
>>>>> before: 4244087 bytes
>>>>> after:  106145 bytes (~40x improvement)
>>>>>
>>>>> Python:
>>>>> before: 482587 bytes
>>>>> after:  317240 bytes (~1.5x improvement)
>>>>>
>>>>> Tensorflow:
>>>>> before: 76915 bytes
>>>>> after:  41267 bytes (~1.9x improvement)
>>>>>
>>>>> The improvement is most significant for libraries that depend
>>>>> heavily on the Windows API (OpenGL) since there's a lot that can be
>>>>> filtered out.
>>>>>
>>>>> Cheers,
>>>>> Jorn
>>>>>
>>>>> Maurizio Cimadamore schreef op 2019-03-19 22:06:
>>>>> Cool! This is very nice milestone for jextract :-)
>>>>>
>>>>> When you have time (and if you can) it would be nice to get some
>>>>> stats
>>>>> on extracting a library (e.g. Python) on Windows before/after, to
>>>>> get
>>>>> a sense of what degree of improvement in terms of static footprint
>>>>> we
>>>>> can get. I tested on Linux and I can get smaller jars (up to 2x) but
>>>>> I
>>>>> sense that the improvement is much more dramatic on Windows-land.
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 19/03/2019 19:58, Jorn Vernee wrote:
>>>>> Thanks for the reviews. I've updated the comment and pushed.
>>>>>
>>>>> Jorn
>>>>>
>>>>> Sundararajan Athijegannathan schreef op 2019-03-19 18:07:
>>>>> Tests passed. Please go ahead with push.
>>>>>
>>>>> -Sundar
>>>>>
>>>>> On 19/03/19, 7:46 PM, Sundararajan Athijegannathan wrote:
>>>>> * 47 * A merge between {@link DependencyFinder} and {@link
>>>>> ElementFilter}
>>>>>
>>>>> in DependencyFilter. Could be removed?
>>>>>
>>>>> No comments other than that. +1
>>>>>
>>>>> I'm going to submit an internal mach5 build/request for this patch.
>>>>> I'll notify you as soon as it is all green.
>>>>>
>>>>> -Sundar
>>>>>
>>>>> On 19/03/19, 7:34 PM, Maurizio Cimadamore wrote:
>>>>> This looks good to me, I have no further comments.
>>>>>
>>>>> Great work!
>>>>>
>>>>> (I'll leave it to Sundar for further comments/suggestions)
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 19/03/2019 13:50, Jorn Vernee wrote:
>>>>> Suggestions sound good.
>>>>>
>>>>> Here's the updated webrev:
>>>>>
>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.07/
>>>>>
>>>>>
>>>>> * Split library based filtering code off from SymbolFilter into
>>>>> LibraryLookupFilter.
>>>>> * Merged DependencyFinder & ElementFilter into DependencyFilter (as
>>>>> nested classes).
>>>>> * Also added 2 more test cases to check that symbols being filtered
>>>>> out don't have their dependencies added.
>>>>>
>>>>> Jorn
>>>>>
>>>>> Maurizio Cimadamore schreef op 2019-03-19 13:40:
>>>>> Looks good, I did some tests and it seems to behave according to the
>>>>>
>>>>> expectations - some comments below:
>>>>>
>>>>> On 18/03/2019 10:58, Jorn Vernee wrote:
>>>>> Hi,
>>>>>
>>>>> I have implemented this:
>>>>>
>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.06/
>>>>> This adds 2 options --include-headers and --exclude-headers that
>>>>> take regular expression patterns as an argument. The patterns are
>>>>> matched against a cursor's full header file path to determine if it
>>>>> appears in the header root set.
>>>>>
>>>>> For example, I can use the following to include headers in the
>>>>> Python include director;
>>>>>
>>>>> --include-headers "J:\\Python27\\include\\.*"
>>>>>
>>>>> ---
>>>>>
>>>>> To recap, we now:
>>>>> 1.) Filter functions, vars, and macros by whether they appear in a
>>>>> root header.
>>>>> 2.) Filter functions, vars, and macros by the
>>>>> --includes/exclude-symbols patterns.
>>>>> 3.) Filter functions and vars, by whether they appear in any
>>>>> shared library.
>>>>>
>>>>> I think this is ok, but the way this happens in the code seems to be
>>>>>
>>>>> sort of ad hoc, and probably the result of how the code use to be in
>>>>>
>>>>> previous rounds.
>>>>>
>>>>> For instance, SymbolFilter does both (1), (2) and (3).
>>>>>
>>>>> Then we have DependencyFinder which also does a bit of (1).
>>>>>
>>>>> Finally we have ElementFilter, which drops everything that is not
>>>>> required (as per 4, 5).
>>>>>
>>>>> I believe the code could be improved by:
>>>>>
>>>>> * separating the library-based filtering from the option-driven
>>>>> filtering. These are two separate mechanism (the latter being
>>>>> optional
>>>>> on -L being used at all). Of course we should also decide which is
>>>>> used first.
>>>>>
>>>>> * bring together DependencyFinder and ElementFilter into
>>>>> DependencyFilter. You can have a two pass visitor - collect deps on
>>>>> a
>>>>> first pass, then prune on the second pass. This way the state
>>>>> remains
>>>>> 'internal' to the DependencyFilter visitor, no need to pass magic
>>>>> state between two passes.
>>>>>
>>>>> Maurizio
>>>>>
>>>>> 4.) For the remaining trees in the root headers do dependency
>>>>> analysis, and find the required structs, enums, and typedefs.
>>>>> 5.) For structs, enums, and typedefs include them if they appear
>>>>> in a root header OR are required as dependency.
>>>>>
>>>>> Cheers,
>>>>> Jorn
>>>>>
>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:29:
>>>>> On 15/03/2019 18:21, Jorn Vernee wrote:
>>>>> Heh - was writing an email pretty much with this suggestion :P
>>>>>
>>>>> I think this is a good idea! If you don't think this adds to much
>>>>> complexity to the options?
>>>>>
>>>>> So we'd basically have:
>>>>>
>>>>> Without --include/exclude-header -> same behavior we have right
>>>>> now.
>>>>> With --include/exclude-header -> Add included headers to root set
>>>>> + find any dependencies, and put both in the generated artifact.
>>>>>
>>>>> I like that it does no filtering by default, so ad-hoc users don't
>>>>> have to figure out which header files define which functions.
>>>>>
>>>>> Yep - I like this too. It goes towards the 'no magic' goal, and it
>>>>> adds the filter option that makes most sense when you want to chop
>>>>> down an API for good. This is a filtering mechanism after all, the
>>>>> fact that it's coming back as a filtering option is IMHO, a plus, no
>>>>> a
>>>>> minus.
>>>>>
>>>>> Btw, slight revision to what I said previously - let's make the
>>>>> argument of include/exclude regexes too (as for others). This way it
>>>>>
>>>>> can be even easier to use, most headers do have some commonalities
>>>>> in
>>>>> them - and I like that, with that,  we can basically do the
>>>>> path-based
>>>>> heuristics w/o hardwiring that into jextract.
>>>>>
>>>>> Maurizio
>>>>>
>>>>> Jorn
>>>>>
>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:07:
>>>>> On 15/03/2019 18:01, Maurizio Cimadamore wrote:
>>>>>
>>>>> On 15/03/2019 17:48, Jorn Vernee wrote:
>>>>> I've implemented this, and now doing a clean build based on the
>>>>> latest jdk/jdk merge before submitting the next webrev.
>>>>>
>>>>> It's working nicely, with one caveat; Some headers rely on
>>>>> pre-processor code defined in a parent header. For instance,
>>>>> Python.h defines some pre-processor code used in pythonrun.h . The
>>>>> example we have only uses functions defined in pythonrun.h, but if
>>>>> we just pass that header to jextract Clang throws an error because
>>>>> of the missing pre-processor code from Python.h
>>>>>
>>>>> In this case we have to pass Python.h first, and then pythonrun.h to
>>>>> get everything to work. This also relies on the existence of header
>>>>> include guards in the pythonrun.h header (since we're basically
>>>>> including it twice). A similar caveat exists with the Windows API.
>>>>>
>>>>> Doh! Maybe I've missed the simplest option after all.
>>>>>
>>>>> What if the root set was a 'first class' concept in the extraction
>>>>> run, rather than something we infer from this or than command line
>>>>> option?
>>>>>
>>>>> That way we could point jextract at python.h, but then say "hey, I'm
>>>>> only really interested at stuff that comes from pythonrun.h".
>>>>
>>>> In other words, right now we have
>>>>
>>>> --include-symbol
>>>>
>>>> and
>>>>
>>>> --exclude-symbol
>>>>
>>>> If not set, everything is included.
>>>>
>>>> Maybe all we need is:
>>>>
>>>> --include-header <header>
>>>>
>>>> --exclude-header <header>
>>>>
>>>> And, again, if none is specified, all headers are part of the 'root
>>>> set'.
>>>>
>>>> And, to address a concern you had - yes, I'd consider using the
>>>> 'include path' for header names in --include/exclude-header
>>>>
>>>> Maurizio
>>>>
>>>>> Maurizio
>>>>>
>>>>> So, the guide-line is: Pass the main header first, then internal
>>>>> headers. e.g. If we have a main header A.h which includes a_impl.h,
>>>>> and another main header B.h which includes b_impl.h, headers should
>>>>> be passed to jextract in the order: A.h a_impl.h B.h b_impl.h
>>>>>
>>>>> I think the behavior being dependent on the ordering of the headers
>>>>> could be fixed by sorting the headers in topological order. But,
>>>>> there's still the requirement to pass all of them, or things break.
>>>>>
>>>>> Any thoughs about that?
>>>>>
>>>>> Thanks,
>>>>> Jorn
>>>>>
>>>>> Jorn Vernee schreef op 2019-03-15 14:40:
>>>>> I've already been using shell scripts mostly when running jextract
>>>>> (except for simple examples). I find it very useful to be able to
>>>>> split the command over multiple lines, especially long file paths
>>>>> become much more readable.
>>>>>
>>>>> I'll start working on this then.
>>>>>
>>>>> Jorn
>>>>>
>>>>> Maurizio Cimadamore schreef op 2019-03-15 14:21:
>>>>> On 15/03/2019 13:15, Jorn Vernee wrote:
>>>>> I still like this approach, and I think adding support for wildcard
>>>>> patterns and/or header filters would make it better.
>>>>>
>>>>> Like you said: It's dead simple. What you pass to jextract is what
>>>>> you get. Though, we could still apply dependency analysis, which
>>>>> would make sure nothing that's needed gets dropped.
>>>>>
>>>>> Why don't we try this then?
>>>>>
>>>>> Note that in javac we often use this trick:
>>>>>
>>>>> javac `find <path> -name *.java`
>>>>>
>>>>> which works
>>>>>
>>>>> also, note that jextract also accepts a file with the @ syntax, etc.
>>>>>
>>>>>
>>>>> jextract @args.txt
>>>>>
>>>>> where args.txt is the command line (which could list all the headers
>>>>> you want!).
>>>>>
>>>>> Maybe this is indeed the simpler approach.
>>>>>
>>>>> Maurizio