notes on binding C++

Tue Jan 30 04:37:06 UTC 2018

Hello, everyone,

On a private channel, Maurizio pointed me back to these notes from John 
Rose in quotes below. This approach does look quite close to what 
JavaCPP is already doing, and it works quite well, but it is not very 
efficient because we go through the stubs. Still, for things like C++ 
virtual method calls, JNI does not add undue overhead.

After some discussion and based on my experience I see that things could 
happen this way:
1. Use a C++ parser like Clang that works to get the AST etc of header 
files, as jextract is already doing
2. Try to map everything to Java interfaces, picking up additional 
information from the user, as with InfoMap in JavaCPP
3. Generate two kinds of C++ stubs: a) One kind that basically does the 
call or returns us the data requested, JNI on its own does the job just 
fine in that case, and b) Another kind that returns metadata to the JVM 
that it can use to do things by itself more efficiently, such as 
offsetof() and sizeof() already returned to JavaCPP allowing users to 
access memory more efficiently, but more awkwardly in the absence of 
something like data layouts
4. Pass them to the C++ compiler pointed to by the user so we can get 
all our layouts, mangling, vtables, etc right. I am not so much in favor 
of using debug information as Stephen Kell is because C++ programmers do 
not rely on that information to /use/ a library: They use the C++ 
compiler required by the library.

I can see things progressing from a JNI-only approach (SWIG), to where 
we get some information back that we can use with say direct NIO buffers 
(anything using libffi, JavaCPP), passing by data layouts (with 
jextract), to more complicated function calls (somehow decoding non-POD 
C++ classes with vtables and mangling names, possibly with tools like 
c++filt/undname.exe), and finally to much more complicated cases like 
virtual inheritance. But virtual inheritance works just fine with JNI 
and I do not currently see any other more efficient approach. Are there 
any? In any case, in my opinion, we will always need the help of a C++ 
compiler for templates, so we might as well use it. For template-only 
libraries, it could make sense to simply use an embedded version of LLVM 
with Clang to JIT everything, but most C++ libraries are hybrid and some 
require their C++ headers to be compiled with a given version of one 
very specific C++ compiler.

BTW, if the C API of libclang exposes all the features we need, it is 
already possible to use it from Java: 
https://github.com/bytedeco/javacpp-presets/tree/master/llvm  Is 
jextract in a good enough shape to offer us such an interface to work 
with and do some dogfooding? If so, it might also be a good place to 
start a testbed for jextract too. That is exactly what the JavaCPP 
Presets are: A testbed for JavaCPP. This is what makes JavaCPP actually 
work with (a few) C++ libraries out there in the wild--unlike SWIG, 
CppSharp, rust-bindgen, etc. Thoughts?

Samuel

On 10/31/2016 11:27 AM, John Rose wrote:
> On Oct 28, 2016, at 10:49 PM, John Rose <john.r.rose at oracle.com> wrote:
>>
>> Mikael and I have had a few good conversations about binding C++ to Java interfaces.
>>
>> The following notes are FTR, TBD, NYI, and every other TLA which implies "tentative".
> 
> 
> Here are a few more thoughts about C++ binding in Java.
> These notes are also captured FTR in this file:
>    http://cr.openjdk.java.net/~jrose/panama/cppapi.cpp.txt
> 
> API point linkage stubs, as generated by jextract.
> 
> Any type has a number of _API points_ that may be applied to values
> of that type.  For example, C++ classes may supply API points for
> field access, method call, implicit conversions, etc.  Making a
> subclass is a complex API point.  Fortran arrays may be read,
> written, sliced, and aliased with other arrays.
> 
> Some API points are defined in terms of an OS-specific ABI, which
> means that on any given system there is a specific series of
> machine instructions that operate the API point.  For ANSI C, all
> API points, except macros, are defined by an ABI.  For C++, ABI
> support may be partial and/or unstable.
> 
> ABI-defined API points are data access (structs and arrays) and
> function calls (both named and via a function pointer).  On some
> systems the ABI may also specify the mechanics of name mangling,
> virtual function calls, and subclass layout.
> 
> A C++ inline function consists of code that is replicated into
> client uses of that function.  Unix-like ABIs do not directly
> represent the action of an inline function, and so API features
> built from function inlining are not supported by thoses ABIs.
> 
> An ABI-defined API point can be operated by a metadata-driven
> mechanism, such as libffi, or the JVM's native call generator.
> Other API points a real compiler to directly emit code, at compile
> time, to operate a particular API point on a particular variable.
> 
> If an ABI could include enough AST or IR capabilities to represent
> a function body, that function could be exported to applications
> without direct inlining at compile time.  The inlining would take
> place during linking or JIT compilation.  This in fact is what the
> JVM does, since its ABI can encode most methods using bytecodes.
> This more powerful representation allows more optimizations to
> occur after link time.
> 
> On Unix-like systems, nearly all API points can be supported at
> least indirectly by the system ABI.  One simple way to do this is
> by wrapping the essential action of each API point (for each type)
> into a a _machine code stub_ which contains the code that the
> compiler would generate to operate that API point.  The stub itself
> must be callable using the ABI; typically it is a function with
> arguments drawn from a limited set of types (pointers and other
> scalars).  If the type being operated on is complex, the stub
> requires the caller to put the type's value in memory first, and
> then pass a pointer to the stub.  In this way, a wide variety of
> non-ABI-capable operations can be expressed using little snippets
> of binary code wrapped in ABI-capable entry points.  These little
> snippets are called out-of-line, and so may cost performance and
> prevent some optimizations.  But they are convenient and often good
> enough.
> 
> The jextract tool scans a header file (or other API specification)
> and finds API points to make available to a Java programmer.  It
> emits metadata in Java native form, which is to say it emits a
> bundle (JAR) of class-files.  The classes are purely abstract
> interfaces describing the shape of the APIs, not their contents.
> Annotations are used to bind ABI parameters to particular names.
> For example, a struct field might be annotated with its type, name,
> and offset, and a function might be annotated with its type, name,
> and linker symbol.  Elements that can be easily computed from the
> Java types and names need not be repeated in annotations.
> 
> When the Java application runs, it loads the extracted metadata and
> runs a _binder_ on it, which gives implementations to all the
> interfaces, implementations which are consistent with the ABI
> requirements.  For example, a struct field might be accessed with
> a call to a "get" or "put" operation from the "Unsafe" facility,
> computing the address using the offset associated with the field.
> 
> An inline function cannot (in the general case) be represented
> fully using metadata, so the jextract tool must also emit a machine
> code stub which wraps the function (as if it were out-of-line).
> The jextract tool must also leave enough "clues" in the metadata to
> enable the binder to associate each API point with the correct
> stub.  These stubs should be emitted in two forms: First, as C++
> code, for purposes of debugging and porting.  Second, as a DLL to
> be loaded into the JVM with the associated library.
> 
> Here are some examples of C++ classes and associated suites of
> machine code stubs.
> 
> http://cr.openjdk.java.net/~jrose/panama/cppapi.cpp.txt
>