[foreign] RFR 8220544: Jextract; library dependent header filtering
Jorn Vernee
jbvernee at xs4all.nl
Wed Mar 20 15:36:52 UTC 2019
Maurizio Cimadamore schreef op 2019-03-20 16:10:
> Yes, but...
>
> I think that implementation dependent behavior is a bit of the point
> here? That is, on a Windows machine you might want both uppercase and
> lowercase to match (given the OS makes no difference), unlike on
> Linux/Mac?
Exactly.
> I'm not sure if we have any precedent for this - the use case for this
> API doesn't seem unreasonable; if we wanted to just use a regular
> regex (not a loose, OS-dependent one), then we probably have to
> normalize all paths coming out from clang, because I suspect we will
> find surprises in there.
I think for OpenGL it's trying to match some standard so that includes
can be more portable;
#include <GL/gl.h> // protable?
#include <gl/GL.h> // Windows native
This is not the only case I ran into though, as the casing for the
Windows SDK path was also partially changed (e.g. Include -> include).
Jorn
> Maurizio
>
> On 20/03/2019 15:02, Sundararajan Athijegannathan wrote:
>> Hmm..
>>
>> The javadoc says this:
>>
>> " For both the glob and regex syntaxes, the matching details, such as
>> whether the matching is case sensitive, are implementation-dependent
>> and therefore not specified."
>>
>> https://docs.oracle.com/javase/10/docs/api/java/nio/file/FileSystem.html
>>
>> Are we dependent on implementation dependent behavior?
>>
>> -Sundar
>>
>>
>> On 20/03/19, 7:21 PM, Jorn Vernee wrote:
>>> While trying to get a minimal artifact for the opengl sample I ran
>>> into a weird problem where clang was changing the casing of some of
>>> the system header paths, which was making the regex match fail.
>>> Casing of paths on Windows is mostly irrelevant, so that explains why
>>> this wasn't a problem before.
>>>
>>> After searching for a solution I arrived at
>>> FileSystem::getPathMatcher, i.e. a utility for matching paths already
>>> exists, and has FileSystem specific implementations (e.g. making the
>>> match case insensitive on Windows).
>>>
>>> Switching to that solves the problem:
>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/caps/webrev.00
>>>
>>> The numbers are:
>>>
>>> OpenGL:
>>> before: 4244087 bytes
>>> after: 106145 bytes (~40x improvement)
>>>
>>> Python:
>>> before: 482587 bytes
>>> after: 317240 bytes (~1.5x improvement)
>>>
>>> Tensorflow:
>>> before: 76915 bytes
>>> after: 41267 bytes (~1.9x improvement)
>>>
>>> The improvement is most significant for libraries that depend heavily
>>> on the Windows API (OpenGL) since there's a lot that can be filtered
>>> out.
>>>
>>> Cheers,
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-03-19 22:06:
>>>> Cool! This is very nice milestone for jextract :-)
>>>>
>>>> When you have time (and if you can) it would be nice to get some
>>>> stats
>>>> on extracting a library (e.g. Python) on Windows before/after, to
>>>> get
>>>> a sense of what degree of improvement in terms of static footprint
>>>> we
>>>> can get. I tested on Linux and I can get smaller jars (up to 2x) but
>>>> I
>>>> sense that the improvement is much more dramatic on Windows-land.
>>>>
>>>> Maurizio
>>>>
>>>> On 19/03/2019 19:58, Jorn Vernee wrote:
>>>>> Thanks for the reviews. I've updated the comment and pushed.
>>>>>
>>>>> Jorn
>>>>>
>>>>> Sundararajan Athijegannathan schreef op 2019-03-19 18:07:
>>>>>> Tests passed. Please go ahead with push.
>>>>>>
>>>>>> -Sundar
>>>>>>
>>>>>> On 19/03/19, 7:46 PM, Sundararajan Athijegannathan wrote:
>>>>>>> * 47 * A merge between {@link DependencyFinder} and {@link
>>>>>>> ElementFilter}
>>>>>>>
>>>>>>> in DependencyFilter. Could be removed?
>>>>>>>
>>>>>>> No comments other than that. +1
>>>>>>>
>>>>>>> I'm going to submit an internal mach5 build/request for this
>>>>>>> patch. I'll notify you as soon as it is all green.
>>>>>>>
>>>>>>> -Sundar
>>>>>>>
>>>>>>> On 19/03/19, 7:34 PM, Maurizio Cimadamore wrote:
>>>>>>>> This looks good to me, I have no further comments.
>>>>>>>>
>>>>>>>> Great work!
>>>>>>>>
>>>>>>>> (I'll leave it to Sundar for further comments/suggestions)
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>> On 19/03/2019 13:50, Jorn Vernee wrote:
>>>>>>>>> Suggestions sound good.
>>>>>>>>>
>>>>>>>>> Here's the updated webrev:
>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.07/
>>>>>>>>> * Split library based filtering code off from SymbolFilter into
>>>>>>>>> LibraryLookupFilter.
>>>>>>>>> * Merged DependencyFinder & ElementFilter into DependencyFilter
>>>>>>>>> (as nested classes).
>>>>>>>>> * Also added 2 more test cases to check that symbols being
>>>>>>>>> filtered out don't have their dependencies added.
>>>>>>>>>
>>>>>>>>> Jorn
>>>>>>>>>
>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-19 13:40:
>>>>>>>>>> Looks good, I did some tests and it seems to behave according
>>>>>>>>>> to the
>>>>>>>>>> expectations - some comments below:
>>>>>>>>>>
>>>>>>>>>> On 18/03/2019 10:58, Jorn Vernee wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I have implemented this:
>>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.06/
>>>>>>>>>>> This adds 2 options --include-headers and --exclude-headers
>>>>>>>>>>> that take regular expression patterns as an argument. The
>>>>>>>>>>> patterns are matched against a cursor's full header file path
>>>>>>>>>>> to determine if it appears in the header root set.
>>>>>>>>>>>
>>>>>>>>>>> For example, I can use the following to include headers in
>>>>>>>>>>> the Python include director;
>>>>>>>>>>>
>>>>>>>>>>> --include-headers "J:\\Python27\\include\\.*"
>>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> To recap, we now:
>>>>>>>>>>> 1.) Filter functions, vars, and macros by whether they
>>>>>>>>>>> appear in a root header.
>>>>>>>>>>> 2.) Filter functions, vars, and macros by the
>>>>>>>>>>> --includes/exclude-symbols patterns.
>>>>>>>>>>> 3.) Filter functions and vars, by whether they appear in
>>>>>>>>>>> any shared library.
>>>>>>>>>>
>>>>>>>>>> I think this is ok, but the way this happens in the code seems
>>>>>>>>>> to be
>>>>>>>>>> sort of ad hoc, and probably the result of how the code use to
>>>>>>>>>> be in
>>>>>>>>>> previous rounds.
>>>>>>>>>>
>>>>>>>>>> For instance, SymbolFilter does both (1), (2) and (3).
>>>>>>>>>>
>>>>>>>>>> Then we have DependencyFinder which also does a bit of (1).
>>>>>>>>>>
>>>>>>>>>> Finally we have ElementFilter, which drops everything that is
>>>>>>>>>> not
>>>>>>>>>> required (as per 4, 5).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I believe the code could be improved by:
>>>>>>>>>>
>>>>>>>>>> * separating the library-based filtering from the
>>>>>>>>>> option-driven
>>>>>>>>>> filtering. These are two separate mechanism (the latter being
>>>>>>>>>> optional
>>>>>>>>>> on -L being used at all). Of course we should also decide
>>>>>>>>>> which is
>>>>>>>>>> used first.
>>>>>>>>>>
>>>>>>>>>> * bring together DependencyFinder and ElementFilter into
>>>>>>>>>> DependencyFilter. You can have a two pass visitor - collect
>>>>>>>>>> deps on a
>>>>>>>>>> first pass, then prune on the second pass. This way the state
>>>>>>>>>> remains
>>>>>>>>>> 'internal' to the DependencyFilter visitor, no need to pass
>>>>>>>>>> magic
>>>>>>>>>> state between two passes.
>>>>>>>>>>
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 4.) For the remaining trees in the root headers do
>>>>>>>>>>> dependency analysis, and find the required structs, enums,
>>>>>>>>>>> and typedefs.
>>>>>>>>>>> 5.) For structs, enums, and typedefs include them if they
>>>>>>>>>>> appear in a root header OR are required as dependency.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Jorn
>>>>>>>>>>>
>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:29:
>>>>>>>>>>>> On 15/03/2019 18:21, Jorn Vernee wrote:
>>>>>>>>>>>>> Heh - was writing an email pretty much with this suggestion
>>>>>>>>>>>>> :P
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think this is a good idea! If you don't think this adds
>>>>>>>>>>>>> to much complexity to the options?
>>>>>>>>>>>>>
>>>>>>>>>>>>> So we'd basically have:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Without --include/exclude-header -> same behavior we have
>>>>>>>>>>>>> right now.
>>>>>>>>>>>>> With --include/exclude-header -> Add included headers to
>>>>>>>>>>>>> root set + find any dependencies, and put both in the
>>>>>>>>>>>>> generated artifact.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I like that it does no filtering by default, so ad-hoc
>>>>>>>>>>>>> users don't have to figure out which header files define
>>>>>>>>>>>>> which functions.
>>>>>>>>>>>>
>>>>>>>>>>>> Yep - I like this too. It goes towards the 'no magic' goal,
>>>>>>>>>>>> and it
>>>>>>>>>>>> adds the filter option that makes most sense when you want
>>>>>>>>>>>> to chop
>>>>>>>>>>>> down an API for good. This is a filtering mechanism after
>>>>>>>>>>>> all, the
>>>>>>>>>>>> fact that it's coming back as a filtering option is IMHO, a
>>>>>>>>>>>> plus, no a
>>>>>>>>>>>> minus.
>>>>>>>>>>>>
>>>>>>>>>>>> Btw, slight revision to what I said previously - let's make
>>>>>>>>>>>> the
>>>>>>>>>>>> argument of include/exclude regexes too (as for others).
>>>>>>>>>>>> This way it
>>>>>>>>>>>> can be even easier to use, most headers do have some
>>>>>>>>>>>> commonalities in
>>>>>>>>>>>> them - and I like that, with that, we can basically do the
>>>>>>>>>>>> path-based
>>>>>>>>>>>> heuristics w/o hardwiring that into jextract.
>>>>>>>>>>>>
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:07:
>>>>>>>>>>>>>> On 15/03/2019 18:01, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15/03/2019 17:48, Jorn Vernee wrote:
>>>>>>>>>>>>>>>> I've implemented this, and now doing a clean build based
>>>>>>>>>>>>>>>> on the latest jdk/jdk merge before submitting the next
>>>>>>>>>>>>>>>> webrev.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It's working nicely, with one caveat; Some headers rely
>>>>>>>>>>>>>>>> on pre-processor code defined in a parent header. For
>>>>>>>>>>>>>>>> instance, Python.h defines some pre-processor code used
>>>>>>>>>>>>>>>> in pythonrun.h . The example we have only uses functions
>>>>>>>>>>>>>>>> defined in pythonrun.h, but if we just pass that header
>>>>>>>>>>>>>>>> to jextract Clang throws an error because of the missing
>>>>>>>>>>>>>>>> pre-processor code from Python.h
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In this case we have to pass Python.h first, and then
>>>>>>>>>>>>>>>> pythonrun.h to get everything to work. This also relies
>>>>>>>>>>>>>>>> on the existence of header include guards in the
>>>>>>>>>>>>>>>> pythonrun.h header (since we're basically including it
>>>>>>>>>>>>>>>> twice). A similar caveat exists with the Windows API.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Doh! Maybe I've missed the simplest option after all.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What if the root set was a 'first class' concept in the
>>>>>>>>>>>>>>> extraction run, rather than something we infer from this
>>>>>>>>>>>>>>> or than command line option?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That way we could point jextract at python.h, but then
>>>>>>>>>>>>>>> say "hey, I'm only really interested at stuff that comes
>>>>>>>>>>>>>>> from pythonrun.h".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In other words, right now we have
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --include-symbol
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --exclude-symbol
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If not set, everything is included.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe all we need is:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --include-header <header>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --exclude-header <header>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And, again, if none is specified, all headers are part of
>>>>>>>>>>>>>> the 'root set'.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And, to address a concern you had - yes, I'd consider
>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>> 'include path' for header names in
>>>>>>>>>>>>>> --include/exclude-header
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So, the guide-line is: Pass the main header first, then
>>>>>>>>>>>>>>>> internal headers. e.g. If we have a main header A.h
>>>>>>>>>>>>>>>> which includes a_impl.h, and another main header B.h
>>>>>>>>>>>>>>>> which includes b_impl.h, headers should be passed to
>>>>>>>>>>>>>>>> jextract in the order: A.h a_impl.h B.h b_impl.h
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think the behavior being dependent on the ordering of
>>>>>>>>>>>>>>>> the headers could be fixed by sorting the headers in
>>>>>>>>>>>>>>>> topological order. But, there's still the requirement to
>>>>>>>>>>>>>>>> pass all of them, or things break.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Any thoughs about that?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Jorn Vernee schreef op 2019-03-15 14:40:
>>>>>>>>>>>>>>>>> I've already been using shell scripts mostly when
>>>>>>>>>>>>>>>>> running jextract
>>>>>>>>>>>>>>>>> (except for simple examples). I find it very useful to
>>>>>>>>>>>>>>>>> be able to
>>>>>>>>>>>>>>>>> split the command over multiple lines, especially long
>>>>>>>>>>>>>>>>> file paths
>>>>>>>>>>>>>>>>> become much more readable.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'll start working on this then.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 14:21:
>>>>>>>>>>>>>>>>>> On 15/03/2019 13:15, Jorn Vernee wrote:
>>>>>>>>>>>>>>>>>>> I still like this approach, and I think adding
>>>>>>>>>>>>>>>>>>> support for wildcard patterns and/or header filters
>>>>>>>>>>>>>>>>>>> would make it better.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Like you said: It's dead simple. What you pass to
>>>>>>>>>>>>>>>>>>> jextract is what you get. Though, we could still
>>>>>>>>>>>>>>>>>>> apply dependency analysis, which would make sure
>>>>>>>>>>>>>>>>>>> nothing that's needed gets dropped.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Why don't we try this then?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Note that in javac we often use this trick:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> javac `find <path> -name *.java`
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> which works
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> also, note that jextract also accepts a file with the
>>>>>>>>>>>>>>>>>> @ syntax, etc.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> jextract @args.txt
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> where args.txt is the command line (which could list
>>>>>>>>>>>>>>>>>> all the headers you want!).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maybe this is indeed the simpler approach.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maurizio
More information about the panama-dev
mailing list