[foreign] RFR 8215502: Jextract output should be self-contained

Mon Dec 17 18:44:36 UTC 2018

Hi,
following external and internal feedback on jextract usability, we have 
been working hard on a partial jextract rewrite which significantly 
simplifies how jextract works and how the various pieces are connected 
together. The idea is that each extraction run should generate a 
self-contained artifact; this is _not_ what is happening now, as 
jextract will only extract things that belong to the header being 
extracted. This means that if the header depends on symbols defined in 
some other included header, in a different folder, then the resulting 
jar artifact will contain dandling refs and will fail to bind. To 
overcome this issues user have in the past been forced to 'chase 
headers' around, and add them to the jextract command line, which is a 
very frustrating (and system-dependent) process.

Also, few months ago we embarked on a project to parse info obtained 
from clang API into an internal representation (see [1]). But we've only 
gone half-way with this - and the code is in a place where sometimes the 
tree is used, sometimes the clang types are used, which can be very 
unpredictable and hard to follow. This patch fixes all that and makes 
trees the single source of truth; this means that some issues in tree 
parsing had to be rectified, to avoid some common issues (see end of 
this email for a list).

Here's the webrev:

https://cr.openjdk.java.net/~mcimadamore/panama/8215502/

(Not sure if it's better to look at the diffs or to just look at the new 
code)

As it can be seen, classes are much smaller now, each doing a fairly 
simple, focussed task:

* Main: this is where we process command line args; we check them and we 
set up a Context

* Context: this is where we hold the shared state - mostly holder for 
command line args (after validation)

* JextractTool: this is the main jextract tool frontend - it has a 
method called 'processHeaders' which does all the job. This work is 
divided in the following steps:

    - parse all headers into trees
    - run a bunch of visitors onto the trees (symbol filter, typedef 
handler, etc.)
    - extract all decls from header trees
    - get rid of duplicate decls
    - group decls by header file
    - for each headerfile
           + enter types into dictionary (TypeEnter)
           + generate bytecode using appropriate code factory
    - return a Writer object

* Writer: helper object used to write classes/jars

* HeaderResolver: helper object which keeps track of path/package 
mappings, and which, given a path, returns the appropriate header file 
object. Note: before this patch, all paths were flattened - e.g. if a 
header was found under /usr/include/linux/foo.h, jextract would emit a 
classfile called usr_include_foo.class. The new patch changes this 
behavior and generates proper nested package names, so that the header 
class will have now name "foo" and package "usr.include.linux".

* TypeDictionary: helper object which, with the help of the 
HeaderResolver, allows to go from clang types to Java types

Overall, this feels like a big improvement in terms of how easy it is to 
tweak the behavior of jextract - each of the main abstraction is simple 
enough that it is relatively easy to tweak things to get a different 
result, which is a huge step forward compared to where the code is ATM 
(at least in my opinion, which could be biased of course :-)). Following 
this change, we will probably do two followup changes:

I think the new logic is very deterministic, and easy to follow; there 
are two items where this is still suboptimal:

1) we allow multiple headerfiles as input, and this sometimes doesn't 
play well with the parser; we have discussed many times how clang API 
behaves better with single input which includes everything. It would 
probably be better to restrict jextract to work on a single header input 
file only (which, thanks to the new resolution logic/self-contained 
model, now works all the times).

2) related to the above, as we discussed, the -t option doesn't make a  
lot of sense when multiple input headers are involved. I think it would 
be best again to restrict input to be just one headerfile (as per 1), 
and also transform -t into a proper 'package prefix' option instead, 
which adds given prefix to all package names generated by jextract. This 
is very useful to package up self-contained artifacts in way that do no 
clash when loaded by same classloader.

As for the issues in tree parsing that were addressed as part of this 
patch, here's a list:

* StructTree::nestedTypes is generating a transitive closure of all 
nested types; this doesn't play well with recursive visitors, as it 
leads trees to be visited multiple times; I changed the code to expand 
only the innermost level of nested types (this affects the 
EmptyNameHandler visitor)

* Tree::equals/hashcode - this code is doing the comparison using 
Cursor::equals, which is good, but doesn't work at all across 
compilation units (e.g. when multiple headers are specified). For this 
reason I changed the comparison code to rely on source location info.

* TypedefHandler has some logic to exclude multiple declaration w/o 
definition, where the definition occurs elsewhere in the tree. But this 
logic doesn't visit structs recursively and, as a result, it fails to 
prune duplicates nested within structs.

P.S.
A big thank to Sundar for all the help troubleshooting failing tests on 
Mac, and for updating some of the samples instruction in our markdown 
file :-)

Cheers
Maurizio

[1] - 
http://mail.openjdk.java.net/pipermail/panama-dev/2018-August/002564.html