hand-edited adjustment of header files

Sun Apr 22 02:33:45 UTC 2018

Hi, John,

JavaCPP uses a "side file" for this purpose, and none of the concerns 
that I had and that you have also expressed here has ever occurred in 
practice. I have successfully provided side information for all the 
C/C++ libraries listed at http://bytedeco.org/ without any issues 
whatsoever.

It just works, no issues, none. (If you do see issues though, please do 
point them out.) Please consider doing something similar for jextract. 
Forcing users to modify header files would bring us back to SWIG... 
Please don't do that! :)

Function-like macros and functions with bodies are also handled without 
any issues! The native C++ compiler compiles those just fine. When I 
spoke to Maurizio last time, he did not seem to mind using the native 
C++ compiler. Why is this an issue for you, John? I think that's 
something we should clear up right away.

BTW, learning how those #pragma work would be in itself an entirely new 
language, whereas JavaCPP uses 100% normal Java and C++ syntax (plus 
some regex) without anything really new or weird to learn.

Samuel

On 04/22/2018 06:26 AM, John Rose wrote:
> We know that layout specifications and function specifications sometimes
> need ad hoc annotations to be gathered or inferred from header sources
> (and other config. information) and passed through jextract to the output
> metadata bundle.
> 
> Here's a "poster child" use case:  In many APIs, most or all occurrences
> of the type "char*" or "const char*" denote strings, in some agreed-on
> encoding (UTF-8, ISO-Latin-1, locale-dependent, etc.) and length convention
> (usually null termination).
> 
> In some APIs, a select few occurrences work like strchr, where a char*
> value denotes a position in a larger sequence of bytes, but most denote
> null-terminated strings, and would operate the same regardless of which
> copy of a given character sequence was presented.
> 
> We want the user to be able to *optionally* request that some or all
> char* occurrences be bound to a carrier type like String or CharSequence,
> and (again optionally) with additional information about sizing and encoding.
> 
> More generally, beyond the use case of C strings, the user sometimes has a
> legitimate need to specify extra semantic information about API points.
> We need to provide a way to pass this information through the jextract
> and binding pipeline.  Classfile annotations, and layout annotations nested
> inside them, are our chosen technique.
> 
> But how is the user to introduce API annotations in the first place?  We
> could define an ad hoc side-file format like the Annotation File Utilities
> of JSR 308.  Those are hard to use well.  We could require the user
> to hand-edit the metadata produced by jextract: After all, they are
> just Java interfaces with user-readable annotations.  Both of these
> solutions have the smell of hand-editing machine-generated sources,
> which is usually a mistake.
> 
> Happily, front ends like clang allow flexible definition and placement of
> both #pragmas and declaration-specific annotations/attributes.  If we
> can find a way to attach such extra information where jextract can find
> it in the C/C++ input code, then this information can be pushed through
> the extraction phase to help control the binding phase.  Newer C++ even
> has a standard syntax for first-class attributes.
> 
> One obvious way to do this would be to edit the original C/C++ headers
> to add annotations in various places.  A global default to direct all
> char* occurrence to be treated as null-terminated locale strings could
> be placed as a #pragma at the top of the header file defining the
> API.  Declaration-specific annotations or nested #pragmas could
> adjust this default for specific API points.
> 
> This technique only works for header file APIs which are co-designed
> with Panama.  Maybe some day this could be a normal design step
> for C APIs, but in most use cases a jextract user must treat the headers
> as read-only givens.
> 
> But, we can put attributes on existing C API points by repeating those
> API points.  This works fine because C (and C++ to a useful extent)
> allows API points to be repeated, as long as they are mutually
> consistent.  Thus, a jextract user could supply an additional header
> file which repeats selected API points to be extracted, but with
> annotations for jextract to pass through.
> 
> We also want to sprinkle attributes in "default mode" across larger
> scopes that encompass multiple API points.  In C++ those scopes
> are often classes or namespaces.  In C they might be structs,
> but probably are not; C API scopes are header files.  So we need
> a #pragma or similar hack to attach default annotations across
> a span of C headers.
> 
> Some tools handle this problem by having #pragmas that "push"
> and "pop" special settings, like turning off selected warnings.
> Perhaps the jextract user will wrap *two* header files around
> the API to be extracted, a leading one to "push" defaults and
> a trailing on to "pop" them and/or perform final adjustments.
> 
> Adding small hand-written header files to a jextract run is
> often a robust way to feed extra information to jextract and
> thence to the binder.  The reason it works (when it works)
> is that the C/C++ is obviously a good language to define
> and refine C/C++ API points.
> 
> This technique can break down in several ways.  First, it may
> be inconvenient (or impossible) to repeat an API element.
> A inline function inside a class might be difficult to re-declare
> separately from the class itself.  An API point with a complicated
> type might be impractical to repeat, even as a forward declaration.
> 
> As a second problem, a default annotation to be applied to
> most of an API might require many declarations to be repeated.
> 
> As a third problem, it might be difficult to locate all the places
> where a default annotation is to be applied, and then hand-edit
> the appropriate annotations.
> 
> All of these problems seem to require a notation for referring
> to groups of C/C++ APIs, which pushes outside the strict
> notations of C/C++.  For example, if I want to declare that
> (by default) all function results of type char* are to be treated
> as strings, I might want to smuggle a wildcard notation inside
> of the C/C++ language:
> 
>     #pragma jextract_name_wildcard("JXANY")
>     __attribute__((__JXCSTRING__)) char* some_api_JXANY();
> 
> This might mean "for any function which returns char* and whose
> name begins with the string 'some_api_', add the __JXCSTRING__
> attribute to that function's return type.   (The __JXCSTRING__
> token would probably expand to something more granular,
> such as a specific layout annotation directive.)
> 
> I don't have a strong opinion on the details of such a mechanism,
> but I do think it is likely that we want to inject it into the native
> C/C++ syntax, rather than have a side file with a specifically
> engineered syntax.  The problem with a side file is that, however
> great the syntax would be, it would require effort from the programmer
> to learn, with special difficulty in managing "path-like" expressions
> which refer, remotely, to selected declarations within the header
> files being processed.
> 
> Perhaps the above kind of hack would work well as a pre-pass
> to jextract, where the oddly-encoded wildcard queries are executed
> to produce edited header files, which can be visually checked and
> *then* fed to jextract *in place of* the original header files.
> Or the pre-pass could generate just the smaller extra header
> files which redeclare selected API points, and both the original
> files and the generated extra text would be passed to jextract,
> after a visual check.  This needs experimentation to find the
> sweet spots.
> 
> Hand editing also brings out the problem of API points defined as
> macros.  This requires a separate set of techniques, since macros
> are less structured than C/C++ declarations.  But it seems likely
> that the user can make use of the basic framework of adding
> small hand-edited header files to the jextract run.  In the case of
> macros, the small hand-edited files could contain user-selected
> invocations of the macros, wrapped in function definitions,
> with the functions named by the user (and/or by jextract using
> something like the #pragma hack above).
> 
> Eventually for C++ jextract must be able to handle functions
> with bodies, and so C macros are only the first example of a
> larger set of use cases with C++.
> 
> It is a legitimate question how jextract should handle functions
> with bodies.  The basic requirement is pretty clear:  The function
> should be broken into declaration and definition, and the declaration
> part processed by jextract as yet another native API point to wrap
> inside a Java API point.  The usual concerns of carrier type and
> carrier name apply.  The definition of the function must also be
> made available to the eventual user of the Java API.
> 
> This can be done manually by compiling the extra source file into
> a small object file, and loading it along with the main library file.
> But such a manual process is error-prone and should be assisted
> by jextract.  In some cases, jextract may be able to invoke a
> C/C++ compiler directly to compile the extra function definitions
> to object code.  The object code could be a resource file
> packaged in the metadata JAR and dynamically loaded
> by the binder.
> 
> In other cases, jextract may need to generate C/C++ source
> code, and associate it with the metadata, with the expectation
> that a later invocation of the compiler will render the source
> code down to loadable object code.
> 
> If jextract does not produce object code directly, then the
> source code might be built as a separate tool invocation
> step (maybe under control of a makefile).  In that case,
> of course, the same environmental parameters (#ifdefs,
> include paths, etc.) must be present in the tool invocation
> as in the jextract invocation.  Or, perhaps jextract handles
> the tool invocation; maybe this is an argument for giving
> jextract an option which points to a compiler toolchain
> to use for producing binaries.
> 
> Most aggressively, the JVM itself, under control of the
> runtime binder, might run the C compiler on the fly and then
> immediately load the resulting binary code.  This approach
> works well in interactive environments, and/or when the
> compiler toolchain is very specialized, as with Theano.
> 
> One advantage of using source code might seem to be
> portability, since a single metadata JAR equipped with
> portable source code might be bindable on multiple
> platforms.  For Panama, this is usually a siren song,
> which we should listened to only with care, since C/C++
> code is often *mostly* portable but also often contains
> unexpected bits of platform dependency.  The safest
> thing to do is rebuild for source on every platform,
> and that includes re-running jextract on each platform.
> 
> Another, more solid advantage of using source code
> is debuggability.  If the source file is available to a native
> debugger, it might be possible to single-step through it.
> Doing this would be a nice-to-have, requiring more
> careful coordination between jextract and the toolchain.
> 
> In any case, the technique of packaging function definitions
> with extracted APIs is a useful way to bridge over portions
> of the extracted language which cannot be easily encoded
> in a standard ABI.  The C language (except for macros)
> can usually be extracted without the need for object code.
> 
> But C++ inline functions cannot be expressed at all, except
> by AST, IR, or object code, and the JVM would prefer to
> operate only on object code.  (Dynamic loading of LLVM
> bitcode would be an intriguing possibility, of course.)
> So in the long run (and not just for C++) the jextract tool
> needs to have a way to package up executable code
> for the binder to use.
> 
> A final note about naming:  When we extract a Java
> API from a native API (of whatever language) we must
> bridge the gap between the types and names of the
> native API and the types and names allowed by Java
> and its foreign-API infrastructure.  We use the terms
> "carrier type" and "carrier name" to express the mapped
> types and names from the native API as they show up
> in Java.  A key job of the annotations and helper functions
> that we have been discussing here is to give the user
> a say in how jextract maps native API points to Java
> API points.  The carrier name is an important Panama
> annotation, and is sometimes necessary when a C/C++
> name turns out to be an illegal Java identifier (like
> "native" or "operator int").
> 
> (A more elaborate default annotation might try to adjust all names
> with underbar components to Java names that use camel case,
> or to strip a common prefix from a set of function or type names.
> Such information would be used by jextract, not the binder.  I'm
> pointing these tricks out out as extreme examples, since it is not
> the current philosophy of Panama to extract APIs with extensive
> transformations.  But such things must be kept in mind as
> possibilities.)
> 
> In the case of an object file produced by jextract, it is often
> the case that the original native names of the APIs points
> are non-existent (or assigned by the jextract user).
> In those cases, jextract will sometimes need to assign
> a name which has no external presence at all in the native
> API.  And it may need to plug the generated function
> into a place which doesn't really have a Java name,
> but instead is some kind of "hook", such as the
> setter side of a method handle pair which provides
> access to a struct field.  In those cases, jextract may
> assign "mangled" names at random, or make the
> "hooks" be static functions for which the binder
> obtains function pointers using an ad hoc linkage API.
> 
> (See Maurizio's very cool forthcoming write-up for more
> information about method handle pairs for managing transcoding
> between Java and native data.)
> 
> Here is an example of wrapping the fileno macro from
> stdio.h, according to the sketches above:
> 
> $ jextract stdio.h jx_stdio.h …
> $ cat jx_stdio.h
> #pragma JX_WRAPPER_PREFIX "jx_"
> int jx_fileno(FILE* fp) { return fileno(fp); }
> 
> Here's what the "hook" function might look like:
> jx_fileno_2943:
> 	movl	12(%rdi), %eax
> 	retq
> 
> The "2943" suffix is assigned by jextract to avoid conflicts with
> other names; it might be a larger hash code, or perhaps the
> function is static and obtained in a special way by the binder.
> The metadata would mention two names, "jx_fileno_2943"
> for the native name (perhaps with a note where to find it)
> and "fileno" for the Java carrier name, used as the name
> of a method in a bindable interface.
> 
> As a further refinement on such hook functions, it seems
> plausible that many of them (but not all!) will be simple
> instruction sequences, perhaps even as simple as the
> above example.  It is likely that we can arrange to do
> special favors for hook functions which are very simple,
> including suppressing the JVM's customary Java-to-native
> handshake, or even inlining the instructions directly into
> Java methods.  For this reason, even C++ APIs, though
> they must be hook-rich, have a prospect of performing
> very well under Panama wrappers.
> 
> — John
> 
> P.S. References
> 
> Here is a worked example a suite of binder hooks, to be
> generated by jextract in the case of a C++ class API:
>    http://cr.openjdk.java.net/~jrose/panama/cppapi.cpp.txt
> 
> Documentation of C/C++ attributes is here:
>    https://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html
>    
> An easy introduction to Theano (mentioned above) is here:
>    http://deeplearning.net/software/theano/tutorial/adding.html
> (Note the sentence "Behind the scene, f was being compiled into C code.")
>