Importing native APIs to Java

Mon Jan 5 13:53:27 UTC 2015

Hello, John,

On 01/01/2015 01:19 PM, John Rose wrote:
> [Resend; one more try with flat, correct text.  Please ignore previous 2 messages.  Happy New Year!]

Happy New Year! Great to hear that things are moving forward with this 
project.

> I've posted some thoughts about importing C and C++ native APIs to Java, including some detailed ideas about what the jextract tool should emit.
>
> To some extent, the paper bears on the vexed question of whether to emphasize references or values in APIs.  The default position is "both, explicitly".  But it calls out where different choices could be made.

 From my experience with JavaCPP, we need to provide mappings for both 
in almost all places where they could be used in the same way from the 
native side. But sometimes we only have values indeed. Besides, we have 
to think about more than lvalues and rvalues: C++11 now also provides 
prvalues, xvalues, and glvalues. What's the plan for those? :)

In this sense, I still feel that the discussion is not leaning enough 
toward C++. It might be that we shouldn't worry too much about C++, but 
since it's one of the main languages used to develop the JDK itself, it 
makes sense to be able to use it as easily as possible from within Java.

In general though, it's starting to sound promising! (Although I'm not 
sure about the awkward namings with $ in them...) Anyway, below, I'll 
put my 2 cents in about issues that I feel might occur down the road 
with the proposed approach, based on my experience with JavaCPP.

> **ISSUE:** Annotation metadata can be spread as piecemeal annotations on
> the individual methods, or else rolled up onto the enclosing header
> file.  In the extreme, it can be rolled up into resource files
> associated with imported header files.  Which organization gives the
> best mix of compactness, ease of access, and expressiveness?
> Provisional answer: Spread annotations are the most natural, and will
> be used in the examples below.

Spreading annotations work great indeed, but some of the metadata is 
going to be global in nature, such as preprocessor options, the name of 
the header files to extract and the native library files to link with. 
We should especially provide some space for those somewhere at the "top" 
because, for example, it's not always desirable to have 3 nested 
interfaces just because some C++ header declares something in 3 nested 
namespaces... Some options could also alternatively become part of the 
build tools, such as jextract, somehow.

> ### Import header files to top-level interfaces
>
> An import unit is a header file.  (It could also be a group of headers.)
> The header file gives its name to a top-level extracted interface.
> Additional extracted interfaces are nested inside it as member types.
> Elements which occur textually in the header file are rendered in the extracted interfaces.
>
> Example:
>
>      $ jextract sys/stat.h
>      C: # line 1 "sys/stat.h" \n …
>      J: interface sys_stat { … }

Basing the import unit on the header file keeps the awkward dichotomy 
between header files and library files. Most of the time, they don't 
even have the same basename!

In this case, for example, users would need to "import something.stat" 
in their Java files, but would need to bring along a file named 
something like libc.dll: That's far from intuitive. That's why I 
recommend to base the import unit on library files in JavaCPP. I am 
wondering, are there any advantages in bringing this distinction between 
header files and library files into Java?

> ### Import typedefs as nested annotation definitions
>
> A new type name introduced in a header file is imported as a member of
> the enclosing extracted interface.  The member itself is an annotation.
>
>      C: typedef int count_t;
>      J: @C.Typedef(int$.class) @interface count_t { }

Typedefs in C/C++ don't create new types. They only provide synonyms to 
existing types. If we are to create new types for those in Java, we need 
to offer some way to convert between all those types, and there can be 
many, many synonyms. And we don't necessarily know of them at compile 
time, but only at runtime, between unrelated libraries...

> In most cases, the getters and setters can be derived automatically from
> the reference functions.  As with global variables, they could be supplied
> as default methods, or else (with some loss of usability) omitted.
>
>      J: interface gauint {
>        abstract C.int$ref re$ref();
>        int re() { return re$ref().get(); }
>        void re$set(int x) { re$ref().$set(x); }
>      }
>
> The set of field names can be derived from the set of method names on
> the rendered interface.  This is not enough to determine a layout,
> however, because Java methods are _not_ reliably ordered, as viewed by
> reflection.  (And _that_ is _sad_.)  Therefore, we must have
> additional metadata that supplies the order of fields, and/or assigns
> their various offsets.

That's not necessarily the case. If we make the header file a 
requirement for compilation, those can be added by the compiler, and 
hidden from the user. It all depends on what gets done in what order and 
how. :)

> ### Import C++ statics to a companion interface
>
> A C/C++ static field, constant, or member of a class `Foo` will be
> imported into an extracted interface which is different from the
> extracted interface that represents the type `Foo` and its instances.
>
>      C: class Foo { static int get_errno(); }
>      J: interface Foo$static { int get_errno(); }

Why is a separate companion interface required? It works just fine from 
within the same class with JavaCPP. I must be missing something here.

> ### Import C++ inline definitions using an extra DLL
>
> In some cases, an inline function definition may be unavailable to a DLL-based binder.
> Such a function can be emitted to an extra DLL at import time and made available to the binder.
>
> (Some macros will also benefit from this treatment.)

How will those get compiled? JavaCPP explicitly requires the user to 
provide a C++ compiler that supports the C++ ABI and runtime used by the 
library...

> ### Typedef preservation
>
> Occurrences of typedefs in C are mapped to their base types during
> import, but the original typedef is (if possible) preserved.
>
>      C: pid_t getpid();
>      J: @pid_t int getpid();
>
>> <span class="smaller"> *Road not taken:* It is possible that we could nominalize C typedefs,
> so that they show up as first class Java types.
>
>> `   C: typedef int pid_t;`\
> `   J: interface pid_t { int $baseValue(); }`\
> `   C: pid_t getpid();`\
> `   J: pid_t getpid();`
>
>> In that case, the nominalized carrier type would be a wrapper for its base type.

"Road not taken": Sounds good given my comment above :)

> The widest integral type in Java is `long`, the signed 64-bit type.
> Integral types, such as `unsigned long long` which have values outside
> this range are represented by boxed `Number` objects.
>
>      C: unsigned long long giant();
>      J: C.uint64_t giant();

That's not very efficient. Integer values are often considered 
"unsigned" even in Java! Or can this somehow be JIT compiled into a 
primitive type in the JVM?

> There are various reasons for distinguishing pointers from references,
> such as the distinctions made in the C and C++ languages themselves.
>
> But the biggest reason to make a distinction is the fact that C
> pointers, in their basic semantics, are much less safe than references,
> because they allow casting and pointer arithmetic (`plus`, `minus`)
> without reliability checks.

But this brings along the complexity of C++, which might not be welcome 
by users of Java that expect simple interfaces. I only provide one base 
Pointer class in JavaCPP for this very reason: Simplicity.

> Likewise it is useful to distinguish pure
> values (r-values) from their references (l-values).  It is mostly
> possible to force the types to coincide, but the semantics are
> disjoint.  In particular, a pure value cannot change due to
> race conditions, while the value of a reference can.  The conversion
> from l-value to r-value is therefore semantically significant.
> Disregarding it leads to programs with race conditions.
>
> So there is a three-way distinction between value, reference, and pointer.
> One might say, as a rule of thumb, "When in doubt, make a distinction."
>
> **ISSUE:** What about multi-level pointers like `int***`?  The notation `int$ptr.ptr.ptr` doesn’t scale well.  Provisional answer:  Pick a number between 1 and 2, and always pre-define that many levels of pointer and reference.
>
> In any case, there must be a pointer type factory to cover multi-level pointer types.

Or we could provide a ptr$ptr type that can dereference itself: That's 
what I do in JavaCPP with the PointerPointer class. Maybe not safe, but 
simple, fast, etc.

> **ISSUE:** How much does all this help is with the next language, such as C#, Lisp, etc.?  Some, hopefully.

Personally, I don't see this as language interop, but as a way to access 
low-level functionality. The JDK contains C++ code, just like a C++ 
compiler contains assembly code, and a user of either of those can drop 
to the lower level when required. AFAIK, the JDK doesn't contain code in 
C# or Lisp so it does not make sense to support those explicitly. 
Interestingly, CPython or Ruby MRI are written in C, so it makes sense 
that they provide an interface to C APIs, but not C++.

I understand that we still have much room to cover, but I hope that what 
I wrote above can help a bit! Comments? Questions?

Samuel