[foreign] RFR 8220544: Jextract; library dependent header filtering

Jorn Vernee jbvernee at xs4all.nl
Wed Mar 20 15:36:52 UTC 2019


Maurizio Cimadamore schreef op 2019-03-20 16:10:
> Yes, but...
> 
> I think that implementation dependent behavior is a bit of the point
> here? That is, on a Windows machine you might want both uppercase and
> lowercase to match (given the OS makes no difference), unlike on
> Linux/Mac?

Exactly.

> I'm not sure if we have any precedent for this - the use case for this
> API doesn't seem unreasonable; if we wanted to just use a regular
> regex (not a loose, OS-dependent one), then we probably have to
> normalize all paths coming out from clang, because I suspect we will
> find surprises in there.

I think for OpenGL it's trying to match some standard so that includes 
can be more portable;

     #include <GL/gl.h> // protable?
     #include <gl/GL.h> // Windows native

This is not the only case I ran into though, as the casing for the 
Windows SDK path was also partially changed (e.g. Include -> include).

Jorn

> Maurizio
> 
> On 20/03/2019 15:02, Sundararajan Athijegannathan wrote:
>> Hmm..
>> 
>> The javadoc says this:
>> 
>> " For both the glob and regex syntaxes, the matching details, such as 
>> whether the matching is case sensitive, are implementation-dependent 
>> and therefore not specified."
>> 
>> https://docs.oracle.com/javase/10/docs/api/java/nio/file/FileSystem.html
>> 
>> Are we dependent on implementation dependent behavior?
>> 
>> -Sundar
>> 
>> 
>> On 20/03/19, 7:21 PM, Jorn Vernee wrote:
>>> While trying to get a minimal artifact for the opengl sample I ran 
>>> into a weird problem where clang was changing the casing of some of 
>>> the system header paths, which was making the regex match fail. 
>>> Casing of paths on Windows is mostly irrelevant, so that explains why 
>>> this wasn't a problem before.
>>> 
>>> After searching for a solution I arrived at 
>>> FileSystem::getPathMatcher, i.e. a utility for matching paths already 
>>> exists, and has FileSystem specific implementations (e.g. making the 
>>> match case insensitive on Windows).
>>> 
>>> Switching to that solves the problem: 
>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/caps/webrev.00
>>> 
>>> The numbers are:
>>> 
>>> OpenGL:
>>>   before: 4244087 bytes
>>>   after:  106145 bytes (~40x improvement)
>>> 
>>> Python:
>>>   before: 482587 bytes
>>>   after:  317240 bytes (~1.5x improvement)
>>> 
>>> Tensorflow:
>>>   before: 76915 bytes
>>>   after:  41267 bytes (~1.9x improvement)
>>> 
>>> The improvement is most significant for libraries that depend heavily 
>>> on the Windows API (OpenGL) since there's a lot that can be filtered 
>>> out.
>>> 
>>> Cheers,
>>> Jorn
>>> 
>>> Maurizio Cimadamore schreef op 2019-03-19 22:06:
>>>> Cool! This is very nice milestone for jextract :-)
>>>> 
>>>> When you have time (and if you can) it would be nice to get some 
>>>> stats
>>>> on extracting a library (e.g. Python) on Windows before/after, to 
>>>> get
>>>> a sense of what degree of improvement in terms of static footprint 
>>>> we
>>>> can get. I tested on Linux and I can get smaller jars (up to 2x) but 
>>>> I
>>>> sense that the improvement is much more dramatic on Windows-land.
>>>> 
>>>> Maurizio
>>>> 
>>>> On 19/03/2019 19:58, Jorn Vernee wrote:
>>>>> Thanks for the reviews. I've updated the comment and pushed.
>>>>> 
>>>>> Jorn
>>>>> 
>>>>> Sundararajan Athijegannathan schreef op 2019-03-19 18:07:
>>>>>> Tests passed. Please go ahead with push.
>>>>>> 
>>>>>> -Sundar
>>>>>> 
>>>>>> On 19/03/19, 7:46 PM, Sundararajan Athijegannathan wrote:
>>>>>>> * 47 * A merge between {@link DependencyFinder} and {@link 
>>>>>>> ElementFilter}
>>>>>>> 
>>>>>>> in DependencyFilter. Could be removed?
>>>>>>> 
>>>>>>> No comments other than that. +1
>>>>>>> 
>>>>>>> I'm going to submit an internal mach5 build/request for this 
>>>>>>> patch. I'll notify you as soon as it is all green.
>>>>>>> 
>>>>>>> -Sundar
>>>>>>> 
>>>>>>> On 19/03/19, 7:34 PM, Maurizio Cimadamore wrote:
>>>>>>>> This looks good to me, I have no further comments.
>>>>>>>> 
>>>>>>>> Great work!
>>>>>>>> 
>>>>>>>> (I'll leave it to Sundar for further comments/suggestions)
>>>>>>>> 
>>>>>>>> Maurizio
>>>>>>>> 
>>>>>>>> On 19/03/2019 13:50, Jorn Vernee wrote:
>>>>>>>>> Suggestions sound good.
>>>>>>>>> 
>>>>>>>>> Here's the updated webrev: 
>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.07/ 
>>>>>>>>> * Split library based filtering code off from SymbolFilter into 
>>>>>>>>> LibraryLookupFilter.
>>>>>>>>> * Merged DependencyFinder & ElementFilter into DependencyFilter 
>>>>>>>>> (as nested classes).
>>>>>>>>> * Also added 2 more test cases to check that symbols being 
>>>>>>>>> filtered out don't have their dependencies added.
>>>>>>>>> 
>>>>>>>>> Jorn
>>>>>>>>> 
>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-19 13:40:
>>>>>>>>>> Looks good, I did some tests and it seems to behave according 
>>>>>>>>>> to the
>>>>>>>>>> expectations - some comments below:
>>>>>>>>>> 
>>>>>>>>>> On 18/03/2019 10:58, Jorn Vernee wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I have implemented this: 
>>>>>>>>>>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/8220544/webrev.06/ 
>>>>>>>>>>> This adds 2 options --include-headers and --exclude-headers 
>>>>>>>>>>> that take regular expression patterns as an argument. The 
>>>>>>>>>>> patterns are matched against a cursor's full header file path 
>>>>>>>>>>> to determine if it appears in the header root set.
>>>>>>>>>>> 
>>>>>>>>>>> For example, I can use the following to include headers in 
>>>>>>>>>>> the Python include director;
>>>>>>>>>>> 
>>>>>>>>>>>     --include-headers "J:\\Python27\\include\\.*"
>>>>>>>>>>> 
>>>>>>>>>>> ---
>>>>>>>>>>> 
>>>>>>>>>>> To recap, we now:
>>>>>>>>>>>   1.) Filter functions, vars, and macros by whether they 
>>>>>>>>>>> appear in a root header.
>>>>>>>>>>>   2.) Filter functions, vars, and macros by the 
>>>>>>>>>>> --includes/exclude-symbols patterns.
>>>>>>>>>>>   3.) Filter functions and vars, by whether they appear in 
>>>>>>>>>>> any shared library.
>>>>>>>>>> 
>>>>>>>>>> I think this is ok, but the way this happens in the code seems 
>>>>>>>>>> to be
>>>>>>>>>> sort of ad hoc, and probably the result of how the code use to 
>>>>>>>>>> be in
>>>>>>>>>> previous rounds.
>>>>>>>>>> 
>>>>>>>>>> For instance, SymbolFilter does both (1), (2) and (3).
>>>>>>>>>> 
>>>>>>>>>> Then we have DependencyFinder which also does a bit of (1).
>>>>>>>>>> 
>>>>>>>>>> Finally we have ElementFilter, which drops everything that is 
>>>>>>>>>> not
>>>>>>>>>> required (as per 4, 5).
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I believe the code could be improved by:
>>>>>>>>>> 
>>>>>>>>>> * separating the library-based filtering from the 
>>>>>>>>>> option-driven
>>>>>>>>>> filtering. These are two separate mechanism (the latter being 
>>>>>>>>>> optional
>>>>>>>>>> on -L being used at all). Of course we should also decide 
>>>>>>>>>> which is
>>>>>>>>>> used first.
>>>>>>>>>> 
>>>>>>>>>> * bring together DependencyFinder and ElementFilter into
>>>>>>>>>> DependencyFilter. You can have a two pass visitor - collect 
>>>>>>>>>> deps on a
>>>>>>>>>> first pass, then prune on the second pass. This way the state 
>>>>>>>>>> remains
>>>>>>>>>> 'internal' to the DependencyFilter visitor, no need to pass 
>>>>>>>>>> magic
>>>>>>>>>> state between two passes.
>>>>>>>>>> 
>>>>>>>>>> Maurizio
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>   4.) For the remaining trees in the root headers do 
>>>>>>>>>>> dependency analysis, and find the required structs, enums, 
>>>>>>>>>>> and typedefs.
>>>>>>>>>>>   5.) For structs, enums, and typedefs include them if they 
>>>>>>>>>>> appear in a root header OR are required as dependency.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Jorn
>>>>>>>>>>> 
>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:29:
>>>>>>>>>>>> On 15/03/2019 18:21, Jorn Vernee wrote:
>>>>>>>>>>>>> Heh - was writing an email pretty much with this suggestion 
>>>>>>>>>>>>> :P
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think this is a good idea! If you don't think this adds 
>>>>>>>>>>>>> to much complexity to the options?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So we'd basically have:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>   Without --include/exclude-header -> same behavior we have 
>>>>>>>>>>>>> right now.
>>>>>>>>>>>>>   With --include/exclude-header -> Add included headers to 
>>>>>>>>>>>>> root set + find any dependencies, and put both in the 
>>>>>>>>>>>>> generated artifact.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I like that it does no filtering by default, so ad-hoc 
>>>>>>>>>>>>> users don't have to figure out which header files define 
>>>>>>>>>>>>> which functions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Yep - I like this too. It goes towards the 'no magic' goal, 
>>>>>>>>>>>> and it
>>>>>>>>>>>> adds the filter option that makes most sense when you want 
>>>>>>>>>>>> to chop
>>>>>>>>>>>> down an API for good. This is a filtering mechanism after 
>>>>>>>>>>>> all, the
>>>>>>>>>>>> fact that it's coming back as a filtering option is IMHO, a 
>>>>>>>>>>>> plus, no a
>>>>>>>>>>>> minus.
>>>>>>>>>>>> 
>>>>>>>>>>>> Btw, slight revision to what I said previously - let's make 
>>>>>>>>>>>> the
>>>>>>>>>>>> argument of include/exclude regexes too (as for others). 
>>>>>>>>>>>> This way it
>>>>>>>>>>>> can be even easier to use, most headers do have some 
>>>>>>>>>>>> commonalities in
>>>>>>>>>>>> them - and I like that, with that,  we can basically do the 
>>>>>>>>>>>> path-based
>>>>>>>>>>>> heuristics w/o hardwiring that into jextract.
>>>>>>>>>>>> 
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 19:07:
>>>>>>>>>>>>>> On 15/03/2019 18:01, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 15/03/2019 17:48, Jorn Vernee wrote:
>>>>>>>>>>>>>>>> I've implemented this, and now doing a clean build based 
>>>>>>>>>>>>>>>> on the latest jdk/jdk merge before submitting the next 
>>>>>>>>>>>>>>>> webrev.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It's working nicely, with one caveat; Some headers rely 
>>>>>>>>>>>>>>>> on pre-processor code defined in a parent header. For 
>>>>>>>>>>>>>>>> instance, Python.h defines some pre-processor code used 
>>>>>>>>>>>>>>>> in pythonrun.h . The example we have only uses functions 
>>>>>>>>>>>>>>>> defined in pythonrun.h, but if we just pass that header 
>>>>>>>>>>>>>>>> to jextract Clang throws an error because of the missing 
>>>>>>>>>>>>>>>> pre-processor code from Python.h
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In this case we have to pass Python.h first, and then 
>>>>>>>>>>>>>>>> pythonrun.h to get everything to work. This also relies 
>>>>>>>>>>>>>>>> on the existence of header include guards in the 
>>>>>>>>>>>>>>>> pythonrun.h header (since we're basically including it 
>>>>>>>>>>>>>>>> twice). A similar caveat exists with the Windows API.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Doh! Maybe I've missed the simplest option after all.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What if the root set was a 'first class' concept in the 
>>>>>>>>>>>>>>> extraction run, rather than something we infer from this 
>>>>>>>>>>>>>>> or than command line option?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> That way we could point jextract at python.h, but then 
>>>>>>>>>>>>>>> say "hey, I'm only really interested at stuff that comes 
>>>>>>>>>>>>>>> from pythonrun.h".
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In other words, right now we have
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --include-symbol
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --exclude-symbol
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If not set, everything is included.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Maybe all we need is:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --include-header <header>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --exclude-header <header>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> And, again, if none is specified, all headers are part of 
>>>>>>>>>>>>>> the 'root set'.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> And, to address a concern you had - yes, I'd consider 
>>>>>>>>>>>>>> using the
>>>>>>>>>>>>>> 'include path' for header names in 
>>>>>>>>>>>>>> --include/exclude-header
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> So, the guide-line is: Pass the main header first, then 
>>>>>>>>>>>>>>>> internal headers. e.g. If we have a main header A.h 
>>>>>>>>>>>>>>>> which includes a_impl.h, and another main header B.h 
>>>>>>>>>>>>>>>> which includes b_impl.h, headers should be passed to 
>>>>>>>>>>>>>>>> jextract in the order: A.h a_impl.h B.h b_impl.h
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think the behavior being dependent on the ordering of 
>>>>>>>>>>>>>>>> the headers could be fixed by sorting the headers in 
>>>>>>>>>>>>>>>> topological order. But, there's still the requirement to 
>>>>>>>>>>>>>>>> pass all of them, or things break.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any thoughs about that?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Jorn Vernee schreef op 2019-03-15 14:40:
>>>>>>>>>>>>>>>>> I've already been using shell scripts mostly when 
>>>>>>>>>>>>>>>>> running jextract
>>>>>>>>>>>>>>>>> (except for simple examples). I find it very useful to 
>>>>>>>>>>>>>>>>> be able to
>>>>>>>>>>>>>>>>> split the command over multiple lines, especially long 
>>>>>>>>>>>>>>>>> file paths
>>>>>>>>>>>>>>>>> become much more readable.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'll start working on this then.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Maurizio Cimadamore schreef op 2019-03-15 14:21:
>>>>>>>>>>>>>>>>>> On 15/03/2019 13:15, Jorn Vernee wrote:
>>>>>>>>>>>>>>>>>>> I still like this approach, and I think adding 
>>>>>>>>>>>>>>>>>>> support for wildcard patterns and/or header filters 
>>>>>>>>>>>>>>>>>>> would make it better.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Like you said: It's dead simple. What you pass to 
>>>>>>>>>>>>>>>>>>> jextract is what you get. Though, we could still 
>>>>>>>>>>>>>>>>>>> apply dependency analysis, which would make sure 
>>>>>>>>>>>>>>>>>>> nothing that's needed gets dropped.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Why don't we try this then?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Note that in javac we often use this trick:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> javac `find <path> -name *.java`
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> which works
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> also, note that jextract also accepts a file with the 
>>>>>>>>>>>>>>>>>> @ syntax, etc.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> jextract @args.txt
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> where args.txt is the command line (which could list 
>>>>>>>>>>>>>>>>>> all the headers you want!).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Maybe this is indeed the simpler approach.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Maurizio


More information about the panama-dev mailing list