hand-edited adjustment of header files

Sat Apr 21 21:26:21 UTC 2018

We know that layout specifications and function specifications sometimes
need ad hoc annotations to be gathered or inferred from header sources
(and other config. information) and passed through jextract to the output
metadata bundle.

Here's a "poster child" use case:  In many APIs, most or all occurrences
of the type "char*" or "const char*" denote strings, in some agreed-on
encoding (UTF-8, ISO-Latin-1, locale-dependent, etc.) and length convention
(usually null termination).

In some APIs, a select few occurrences work like strchr, where a char*
value denotes a position in a larger sequence of bytes, but most denote
null-terminated strings, and would operate the same regardless of which
copy of a given character sequence was presented.

We want the user to be able to *optionally* request that some or all
char* occurrences be bound to a carrier type like String or CharSequence,
and (again optionally) with additional information about sizing and encoding.

More generally, beyond the use case of C strings, the user sometimes has a
legitimate need to specify extra semantic information about API points.
We need to provide a way to pass this information through the jextract
and binding pipeline.  Classfile annotations, and layout annotations nested
inside them, are our chosen technique.

But how is the user to introduce API annotations in the first place?  We
could define an ad hoc side-file format like the Annotation File Utilities
of JSR 308.  Those are hard to use well.  We could require the user
to hand-edit the metadata produced by jextract: After all, they are
just Java interfaces with user-readable annotations.  Both of these
solutions have the smell of hand-editing machine-generated sources,
which is usually a mistake.

Happily, front ends like clang allow flexible definition and placement of
both #pragmas and declaration-specific annotations/attributes.  If we
can find a way to attach such extra information where jextract can find
it in the C/C++ input code, then this information can be pushed through
the extraction phase to help control the binding phase.  Newer C++ even
has a standard syntax for first-class attributes.

One obvious way to do this would be to edit the original C/C++ headers
to add annotations in various places.  A global default to direct all
char* occurrence to be treated as null-terminated locale strings could
be placed as a #pragma at the top of the header file defining the
API.  Declaration-specific annotations or nested #pragmas could
adjust this default for specific API points.

This technique only works for header file APIs which are co-designed
with Panama.  Maybe some day this could be a normal design step
for C APIs, but in most use cases a jextract user must treat the headers
as read-only givens.

But, we can put attributes on existing C API points by repeating those
API points.  This works fine because C (and C++ to a useful extent)
allows API points to be repeated, as long as they are mutually
consistent.  Thus, a jextract user could supply an additional header
file which repeats selected API points to be extracted, but with
annotations for jextract to pass through.

We also want to sprinkle attributes in "default mode" across larger
scopes that encompass multiple API points.  In C++ those scopes
are often classes or namespaces.  In C they might be structs,
but probably are not; C API scopes are header files.  So we need
a #pragma or similar hack to attach default annotations across
a span of C headers.

Some tools handle this problem by having #pragmas that "push"
and "pop" special settings, like turning off selected warnings.
Perhaps the jextract user will wrap *two* header files around
the API to be extracted, a leading one to "push" defaults and
a trailing on to "pop" them and/or perform final adjustments.

Adding small hand-written header files to a jextract run is
often a robust way to feed extra information to jextract and
thence to the binder.  The reason it works (when it works)
is that the C/C++ is obviously a good language to define
and refine C/C++ API points.

This technique can break down in several ways.  First, it may
be inconvenient (or impossible) to repeat an API element.
A inline function inside a class might be difficult to re-declare
separately from the class itself.  An API point with a complicated
type might be impractical to repeat, even as a forward declaration.

As a second problem, a default annotation to be applied to
most of an API might require many declarations to be repeated.

As a third problem, it might be difficult to locate all the places
where a default annotation is to be applied, and then hand-edit
the appropriate annotations.

All of these problems seem to require a notation for referring
to groups of C/C++ APIs, which pushes outside the strict
notations of C/C++.  For example, if I want to declare that
(by default) all function results of type char* are to be treated
as strings, I might want to smuggle a wildcard notation inside
of the C/C++ language:

   #pragma jextract_name_wildcard("JXANY")
   __attribute__((__JXCSTRING__)) char* some_api_JXANY();

This might mean "for any function which returns char* and whose
name begins with the string 'some_api_', add the __JXCSTRING__
attribute to that function's return type.   (The __JXCSTRING__
token would probably expand to something more granular,
such as a specific layout annotation directive.)

I don't have a strong opinion on the details of such a mechanism,
but I do think it is likely that we want to inject it into the native
C/C++ syntax, rather than have a side file with a specifically
engineered syntax.  The problem with a side file is that, however
great the syntax would be, it would require effort from the programmer
to learn, with special difficulty in managing "path-like" expressions
which refer, remotely, to selected declarations within the header
files being processed.

Perhaps the above kind of hack would work well as a pre-pass
to jextract, where the oddly-encoded wildcard queries are executed
to produce edited header files, which can be visually checked and
*then* fed to jextract *in place of* the original header files.
Or the pre-pass could generate just the smaller extra header
files which redeclare selected API points, and both the original
files and the generated extra text would be passed to jextract,
after a visual check.  This needs experimentation to find the
sweet spots.

Hand editing also brings out the problem of API points defined as
macros.  This requires a separate set of techniques, since macros
are less structured than C/C++ declarations.  But it seems likely
that the user can make use of the basic framework of adding
small hand-edited header files to the jextract run.  In the case of
macros, the small hand-edited files could contain user-selected
invocations of the macros, wrapped in function definitions,
with the functions named by the user (and/or by jextract using
something like the #pragma hack above).

Eventually for C++ jextract must be able to handle functions
with bodies, and so C macros are only the first example of a
larger set of use cases with C++.

It is a legitimate question how jextract should handle functions
with bodies.  The basic requirement is pretty clear:  The function
should be broken into declaration and definition, and the declaration
part processed by jextract as yet another native API point to wrap
inside a Java API point.  The usual concerns of carrier type and
carrier name apply.  The definition of the function must also be
made available to the eventual user of the Java API.

This can be done manually by compiling the extra source file into
a small object file, and loading it along with the main library file.
But such a manual process is error-prone and should be assisted
by jextract.  In some cases, jextract may be able to invoke a
C/C++ compiler directly to compile the extra function definitions
to object code.  The object code could be a resource file
packaged in the metadata JAR and dynamically loaded
by the binder.

In other cases, jextract may need to generate C/C++ source
code, and associate it with the metadata, with the expectation
that a later invocation of the compiler will render the source
code down to loadable object code.

If jextract does not produce object code directly, then the
source code might be built as a separate tool invocation
step (maybe under control of a makefile).  In that case,
of course, the same environmental parameters (#ifdefs,
include paths, etc.) must be present in the tool invocation
as in the jextract invocation.  Or, perhaps jextract handles
the tool invocation; maybe this is an argument for giving
jextract an option which points to a compiler toolchain
to use for producing binaries.

Most aggressively, the JVM itself, under control of the
runtime binder, might run the C compiler on the fly and then
immediately load the resulting binary code.  This approach
works well in interactive environments, and/or when the
compiler toolchain is very specialized, as with Theano.

One advantage of using source code might seem to be
portability, since a single metadata JAR equipped with
portable source code might be bindable on multiple
platforms.  For Panama, this is usually a siren song,
which we should listened to only with care, since C/C++
code is often *mostly* portable but also often contains
unexpected bits of platform dependency.  The safest
thing to do is rebuild for source on every platform,
and that includes re-running jextract on each platform.

Another, more solid advantage of using source code
is debuggability.  If the source file is available to a native
debugger, it might be possible to single-step through it.
Doing this would be a nice-to-have, requiring more
careful coordination between jextract and the toolchain.

In any case, the technique of packaging function definitions
with extracted APIs is a useful way to bridge over portions
of the extracted language which cannot be easily encoded
in a standard ABI.  The C language (except for macros)
can usually be extracted without the need for object code.

But C++ inline functions cannot be expressed at all, except
by AST, IR, or object code, and the JVM would prefer to
operate only on object code.  (Dynamic loading of LLVM
bitcode would be an intriguing possibility, of course.)
So in the long run (and not just for C++) the jextract tool
needs to have a way to package up executable code
for the binder to use.

A final note about naming:  When we extract a Java
API from a native API (of whatever language) we must
bridge the gap between the types and names of the
native API and the types and names allowed by Java
and its foreign-API infrastructure.  We use the terms
"carrier type" and "carrier name" to express the mapped
types and names from the native API as they show up
in Java.  A key job of the annotations and helper functions
that we have been discussing here is to give the user
a say in how jextract maps native API points to Java
API points.  The carrier name is an important Panama
annotation, and is sometimes necessary when a C/C++
name turns out to be an illegal Java identifier (like
"native" or "operator int").

(A more elaborate default annotation might try to adjust all names
with underbar components to Java names that use camel case,
or to strip a common prefix from a set of function or type names.
Such information would be used by jextract, not the binder.  I'm
pointing these tricks out out as extreme examples, since it is not
the current philosophy of Panama to extract APIs with extensive
transformations.  But such things must be kept in mind as
possibilities.)

In the case of an object file produced by jextract, it is often
the case that the original native names of the APIs points
are non-existent (or assigned by the jextract user).
In those cases, jextract will sometimes need to assign
a name which has no external presence at all in the native
API.  And it may need to plug the generated function
into a place which doesn't really have a Java name,
but instead is some kind of "hook", such as the
setter side of a method handle pair which provides
access to a struct field.  In those cases, jextract may
assign "mangled" names at random, or make the
"hooks" be static functions for which the binder
obtains function pointers using an ad hoc linkage API.

(See Maurizio's very cool forthcoming write-up for more
information about method handle pairs for managing transcoding
between Java and native data.)

Here is an example of wrapping the fileno macro from
stdio.h, according to the sketches above:

$ jextract stdio.h jx_stdio.h …
$ cat jx_stdio.h
#pragma JX_WRAPPER_PREFIX "jx_"
int jx_fileno(FILE* fp) { return fileno(fp); }

Here's what the "hook" function might look like:
jx_fileno_2943:
	movl	12(%rdi), %eax
	retq

The "2943" suffix is assigned by jextract to avoid conflicts with
other names; it might be a larger hash code, or perhaps the
function is static and obtained in a special way by the binder.
The metadata would mention two names, "jx_fileno_2943"
for the native name (perhaps with a note where to find it)
and "fileno" for the Java carrier name, used as the name
of a method in a bindable interface.

As a further refinement on such hook functions, it seems
plausible that many of them (but not all!) will be simple
instruction sequences, perhaps even as simple as the
above example.  It is likely that we can arrange to do
special favors for hook functions which are very simple,
including suppressing the JVM's customary Java-to-native
handshake, or even inlining the instructions directly into
Java methods.  For this reason, even C++ APIs, though
they must be hook-rich, have a prospect of performing
very well under Panama wrappers.

— John

P.S. References

Here is a worked example a suite of binder hooks, to be
generated by jextract in the case of a C++ class API:
  http://cr.openjdk.java.net/~jrose/panama/cppapi.cpp.txt

Documentation of C/C++ attributes is here:
  https://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html

An easy introduction to Theano (mentioned above) is here:
  http://deeplearning.net/software/theano/tutorial/adding.html
(Note the sentence "Behind the scene, f was being compiled into C code.")