notes on binding C++

Sat Oct 29 02:49:50 UTC 2016

Mikael and I have had a few good conversations about binding C++ to Java interfaces.

The following notes are FTR, TBD, NYI, and every other TLA which implies "tentative".

The job is constrained on one hand by the many degrees of freedom of C++ APIs, and on
the other by the relative simplicity of Java interfaces.  As a really simple example, a
"source name" in C++ can often, but not always, be rendered directly as a "carrier
name", of a method in a Java interface.  The source name "throws" has to be
perturbed to "fit" into the Java language.  Similarly, source types and source scopes
map to Java types and Java scopes in a complex way.  Yet we think the usual result can
be made to feel useful and even natural to the C++ programmer coding in Java syntax.

Key principle:  To express access ("bindings") to native APIs,  we use Java interfaces only.
No concrete types will appear to the end-user, since those would constrain the implementation
underneath.  (For example, we want to use value types when the time comes!)  We don't
even use the full range of behaviors interfaces can have:  Default methods will not be used,
or at most for simple "macro-like" patterns to provide abbreviations to end-users.  (E.g.,
compose "getter" and "setter" methods on top of a single "get-address" method, for
fields which can be addressed that way.)  Static constant fields will be used rarely
or never, even for C constants like EOF.  (Such static constants can be created in a
post-processing phase, by a "civilizer" tool.)

What about interface subtyping?  Can we make use of that?  Consider a simple use-case:

class A { virtual void vm(); }  // "Virtual Method"
class B : public A { virtual void vm(); virtual void vm2(); }

In this case, in C++ every instance of B can be treated as if it were of type A.
So it would make sense to have a pair of extracted Java interfaces with subtyping:

interface A extends ObjectReference<A> { void vm(); }
interface B extends A, ObjectReference<B> { void vm(); void vm2(); }

(Here, the common super ObjectReference contains all methods relevant to
managing C++ object references.  It is TBD and NYI.  It is probably a subtype
of Reference.)

There are two ways a native object of type B can be encapsulated in Java.
If it is presented via an extracted API to Java using the static type B, then
Java will wrap it in a pointer of type B, and the 'vm2' method will be present
in Java.  But if the native object is presented to Java via an static API type of
A (which is perfectly valid in C++), then Java will wrap it in a pointer of type A.
In that case, even though it is a native object of type B, there will be no visible
clue to this fact, and no access to 'vm2' will be given.  If Java executes the 'vm'
method, B::vm will of course be executed, not because of Java dispatch, but
because of C++ dispatch from a virtual call to vm in A.

Suppose a Java pointer of type A really points to a native object of type B.  Can
the B type be recovered?  Yes, in either of two ways:  First, the user can issue
an unsafe down-cast from A to B.  This requires special knowledge, and/or boldness.
Second, if C++ provides RTTI, and the jextract tool arranges (somehow, TBD, NYI)
to consult this information, then a safe exception-throwing downcast can be supplied
to the Java user, which would re-wrap the A pointer as a B pointer (after verifying
correctness using RTTI).  Note that this has to be a method call, not a Java cast
from A to B.  The wrapping of the B pointer in an A wrapper does not automatically
include the ability to cast to the B type.

Confusing?  Yes.  The confusion increases as we attempt to model C++ type system
relations with Java type system relations.  A way to simplify the situation, then,
would be to remove the relation between Java types A and B, making them disjoint:

interface A extends ObjectReference<A> { void vm(); }
interface B extends ObjectReference<B> { void vm(); void vm2(); }

The common reference type includes viewing operations which can used to recover
the A-reference view from a B-reference.  Unless the binder (for some reason) merges
the implementations for A and B object  references, a direct Java cast won't work:

  B myb = …;
  A aview = (B) myb;  // FAIL with ClassCastException

What if I, as a Panama programmer, create an instance of B (native B wrapped in
a B interface), and then try to pass it to a C++ API that accepts A?

  interface API { void foo(A obj); }

If the interface types do not have a sub/super relation, that will fail to compile,
won't it?

  API api = …;
  B myb = …;
  api.foo(myb);  //FAIL if A/B disjoint
  A aview = myb;    //FAIL if A/B disjoint

Sounds like a flaw in the user model.  We can fix this in part by asking the user
to explicitly perform a C++ up-cast from B to A, using a view-as operation:

  api.foo(myb.viewAs(A.class));  /WIN even if A/B disjoint
  A aview = myb.viewAs(A.class);    //WIN even if A/B disjoint

The view-as operation can be pushed into the bound API, so it can be made
automatic, if the extracted API is more weakly typed:

  interface API { void foo(/*A*/ ObjectReference<?> obj); }

But that seems too weak.  Shall we go back to having B extend A?

Wait a moment; there are more problems with that.  In Java, interface methods
are always virtual, but in C++ methods do not have to be virtual.  Moreover, C++
has non-method C++ constructs which are not virtual.  These include fields,
constructors, qualified method invocations, and statics.

What do we mean when we say "non-virtual construct" for a C++ type A?
Given C++ types A and B extending A, a feature involving A is "virtual" if it passes
this test:  We create an object of type B and then apply the construct to the object
in two ways, once via the static type B, and once (after assigning its reference
to an A-reference) via the static type A.  The test succeeds if the two applications
of the same construct perform exactly the same actions.  The test fails if the
static type affects the semantics of the construct.  For a given A/B pair, a construct
is called "virtual" if it passes that test and "non-virtual" otherwise.

(For convenience, let's say that constructs which apply to the type alone, and
not to the instance, are also non-virtual, since the given test cannot be applied.
Thus, statics and constructors are non-virtual.)

Put a field of the same name in both A and B, and apply the virtuality test:

  class A { int f; }
  class B : public A { int f; }
  B myb = …;
  int x1 = myb.f;
  A& mya = myb;
  int x2 = mya.f;

What happens?  Since B shadows A::f with its own B::f, and since the C++ language
does not make fields virtual, it follows that the f-field is not virtual in A and B.

The same thing happens for C++ methods which are not declared virtual, since they
also shadow instead of override:

  class A { void pm(); }  // "Plain Method"
  class B : public A { void pm(); }

What happens if we model non-virtual features of C++ classes using Java
interfaces?  If the interfaces are disjoint, it would seem there is no ambiguity
between which method is invoked:

  interface A extends ObjectReference<A> { void pm(); }
  interface B extends ObjectReference<B> { void pm(); }

  B myb = …;
  myb.pm();  // => B::pm
  A aview = myb.viewAs(A.class);    //WIN even if A/B disjoint
  aview.pm();  // => A::pm

Can anything go wrong here?  Just the usual thing to annoy a Panama programmer:
Since A is not a super of B, we have to make an explicit view-as call instead of a
cast or implicit conversion.  What if we put back in the type relation (B extends A)?

  interface A extends ObjectReference<A> { void pm(); }
  interface B extends A, ObjectReference<B> { void pm(); }

Notice what happens:  The non-virtual method becomes partially virtualized,
depending the ambiguity mentioned before.  Let's grab a native B object and
wrap it in a B pointer and then an A pointer:

  // api.h:  inline B* make_B() { return new B(); }
  B myb = api.make_B();  // ad hoc wrapper over operator new
  myb.pm();  // => B::pm, so far so good
  A aview1 = myb;  // Java ref-cast
  aview1.pm();  // => same B::pm, per rules of Java ref-casting
  A aview2 = myb.viewAs(A.class);
  aview2.pm();  // => A::pm, per rules of C++

In C++ the same object, of type both A and B, can respond in two different ways
to the invocation of the method name "pm".  In Java the same thing can happen,
but the rules for selection depend on the wrapped pointer's dynamic type, not
on the static type (as in C++).  If you have a list of A's (List<A>) in Java, and you
invoke "pm" on each, you will get a mix of "A::pm" and "B::pm".  If any of the objects
in the list are true A's, then only "A::pm" will be reachable, of course.  But if some
are true B's, then a mix of either method can be reached, depending on which
B's were wrapped as A's and which were wrapped as B's.

(The same point applies to the field 'f' above.  In effect, the field becomes
partially virtualized, if B extends A.)

There is a trick to keep interface subtyping, but prevent the unpredictable
virtualization of non-virtuals.  The trick is to mangle the method descriptors
so that any non-virtual construct is represented by a name which will not be
repeated in a subtype (for any construct at all).  Descriptors can be mangled
by name:

  interface A extends ObjectReference<A> { void A$pm(); }  // Exact translate of C++ A::pm!
  interface B extends A, ObjectReference<B> { void B$pm(); }

More subtly, they can be mangled by type, which is sometimes useful:

  interface A extends ObjectReference<A> { void pm(A which); }
  interface B extends A, ObjectReference<B> { void pm(B which); }

The argument "which" contributes only a static type, and can be a null.

(When binding such interfaces, the binder should detect non-virtual features
that accidentally alias in their descriptors, and signal an error.  This error
checking can be helped if the methods which are truly virtual are distinguished
from other methods.  A "@Virtual" annotation would help a lot.
Maybe also @NonVirtual, but then everything gets that annotation.)

The mangling can be one-sided:

  interface A extends ObjectReference<A> { void pm(); }  // => A::pm
  interface B extends A, ObjectReference<B> { void B$pm(); }  // => B::pm

(One-sided mangling in the super only makes sense if you can enumerate
all its subs!)

With such mangling, either one-sided or two-sided, the random devirtualization
goes away.  What remains is that one or both of the non-virtual method names
has a surprising name to the Java programmer.

To me, all this is surprising and irregular, enough so to motivate a ban on Java
interface subtyping, for interfaces that express any non-virtual constructs.

A graceful way to balance these concerns is to allow each C++ class to import
as (at least) *two* interfaces, the "virtual-friendly" interface and the "plain"
interface.  Call these A$v and A$p, B$v and B$p.  (Please think of those extra letters
as superscripts, when writing on paper or whiteboard.)  The virtual-friendly interfaces
can safely represent the true C++ relation, but should only contain virtuals.

  interface A$v extends ObjectReference<? extends A$v> { @Virtual void vm(); }
  interface B$v extends A$v, ObjectReference<? extends B$v> { @Virtual void vm(); }

(These interfaces can also contain mangled non-virtuals, if needed.)

The non-virtual friendly interfaces would contain the other members:

  interface A$p extends ObjectReference<A$p> { void pm(); }
  interface B$p extends ObjectReference<B$p> { void pm(); }

These are truly disjoint interfaces.  No Java object would ever implement both
of them, since it would be unable to provide a unique binding for its "pm" method.

The four types {A,B}${v,p} can be thought of as four quadrants of a square
containing all the methods of B.  In the upper left and right are the virtual and
non-virtual methods of A, while the lower left and right have the virtual and
non-virtual methods of B.  If you need to call vm, you can use either of the
left-hand quadrants.  But if you need to call pm, first you need to mentally
qualify it as A::pm or B::pm, and then select the proper right-hand quadrant.

Changing quadrants requires a manual view-as operation, except for the
case of moving from B$v to its supertype A$v.

How can be make this more user-friendly?  Well, it would do no harm for the
non-virtual friendly interface to extend the virtual ones:

  interface A$p extends A$v { void pm(); }
  interface B$p extends B$v { void pm(); }

This produces a pattern we can call "spine and barb", by analogy with feathers.
The A$v types have a deep inheritance chain.  (That is the spine of the feather.)
Each A$p type sticks off of from its corresponding A$v type.  (That is a barb.)

When you have a B$p in your hand, you can access all methods except
those in the A$p quadrant.  To get to A::pm, you need to do a view-as A$p.
By contrast, if you have an A$p (more rare, I think), you have to do view-as
B$p to get B::pm instead of A::pm.  You also have to do a view-as to B$p
or B$v in order to see any new virtuals in B (not already in A).

Note that C++ allows virtual methods to be devirtualized.  For example,
a B object can call A::vm, even though B::vm overrides it.  For Panama, this
can be modeled in B$v, B$p, or in a fresh disjoint interface B$q.  Here it is
for B$v:

  interface A$v extends ObjectReference<? extends A$v>
    { @Virtual void vm(); void A$vm(); }  // virtual vm, A::vm
  interface B$v extends A$v, ObjectReference<? extends B$v>
    { @Virtual void vm(); void B$vm(); }  // virtual vm, B::vm

Except for the disjoint case, B$v with its virtual methods is a super, so mangling
("A$vm") is needed to avoid colliding with the truly virtual method ("vm").

Are there other versions of a type besides B$v and B$p?  Well, you could split out
every different kind of class feature into its own separate interface.  (Indeed, you
could split every individual feature into its own interface, but that's clearly overkill.)
It seems natural to consider three interfaces:  B$v for the virtuals of B, B$s for every
feature that does *not* operate on an instance of B, and B$v for everything else.
Here B$s includes static fields and methods.  Constructors are also in B$s, if we
model them as static factory methods, or they could be in some B$c.  Fields
or qualified method references could be put into B$p, or their own B$f or B$q.

So we could have up to half a dozen or more interfaces per C++ class.  Or we
could have as few as one interface, at the cost of heavy mangling.

All of these types use a single underlying representation, which is a C pointer plus
some metadata about type (and scope).  Different views of the same object would
keep the same pointer, but shift the metadata.  For any given view, the access methods
would bind through to the appropriate jextracted entry point.  (There is machine
code in there, but that's a different story!)

When extracting native APIs from a header file, if there is more than one interface
per class, we should carefully pick which interface to use at any given point.
As Mikael has observed, it is useful in such cases to accept generically and
produce specifically.  ("Be liberal in what you accept and conservative in what
you produce", or some such.)  Function parameters should be of the form
A$v and function returns of the form A$p.  For read-write features (non-const
fields), assuming reads are more common than writes, A$p (like a function
return) seems the right choice.

Why have even two interfaces for one class?  If we mangle enough, we can
get it down to one A$v.  The problem with this is users will have to mangle
all the time for every plain, non-virtual feature.  That will get old, won't it?
That's why the plain-friendly A$p interface seems to earn its keep.  (But let's
try it both ways and see which is easier overall.)

On the other hand, we could put every non-virtual feature (except perhaps
statics and constructors) into A$v, with mangling everywhere.  That has an
appealingly comprehensive feel:  Everything is in one place, even if the
labels are rather ornate.  Then, for ease of use, define all the one-off "barb"
interfaces A$p, which simply bridge from friendly non-qualified names (like "pm")
into the qualified names in the main interface A$v (like "A$pm").  In that
case, one interface contains everything, while another one provides some
sugar for the user.  Here's an example:

  class C : public A { public:
    int f;
    void pm();  // "plain" = non-virtual, non-static
    virtual void vm();
    virtual void vm2() = 0;
  };

  interface C$v extends A, ObjectReference<? extends C$v> {
    IntReference f$ref();
    default int f$get() { return f$ref().getAsInt(); }
    default void f$set(int x) { f$ref().set(x); }
    void C$pm();  // C::pm only
    @Virtual void vm();  // true virtual
    void C$vm();  // qualified ref to C::vm
    @Virtual void vm2();
    // no C$vm2, since it's abstract
  }
  interface C$p extends C$v {
    // we do not have to mangle stuff here, since nobody subclasses C$p
    default int f() { return f$get(); }  // maybe?
    default void f(int x) { f$set(x); }  // maybe?
    default void pm() { return C$pm(); }
    default void pm2() { return A$pm(); }
  }

Note the final access to A::pm2 under its simple name.  Since C$p will never be
extended by subclass interfaces, there is no danger of accidental override of we
"sugar up" all the names available via the super-type.  So if class A contains
plain methods "pm" and "pm2", both of those will be in the view A$p, but only
the second can be in C$p, since C::pm is shadowing A::pm, and C$p must
pick which one to bind to the name "pm".

The A$v interface can be maximized if can find a way to include statics and
constructors.  This could be done (for example) by reifying a C++ null reference
for each distinct object type.  Then we would have something vaguely similar
to JavaScript, where a prototype object is the parent of a class of children.

  class C { public:
    …
    static void sm();
    static int sf;
    C();
    virtual ~C();
    C(const C& that);
  };

  interface C$v extends ObjectReference<? extends C$v> {
    …
    void C$sm();
    IntReference C$sf$ref();
    default int C$sf$get() …
    void C$new_(Scope s);
    @Virtual void delete();
    void C$copy_(Scope s, C$v that);
  }
  interface C$p extends C$v {
    …
    default void sm() { return C$sm(); }
    default int sf$ref() { return C$sf$ref(); }
    default int sf() { return C$sf$get(); } // maybe?
    default C$p new_(Scope s) { return C$new_(s); }
    default C$p copy_(Scope s, C$v that) { return C$copy_(s, that); }
  }

Assuming we have to have at least two viewing interfaces per class, how should
the interfaces be named?  We don't need to mangle the names for the user if we
use nested classes.  Given a header file with two classes A, B, any of the following
configurations would work:

interface thehdr {  // take #1
  interface A {  // Ap
    interface virtuals { } // Av
  }
  interface B {  // Bp
    interface virtuals extends A.virtuals { } // Bv
  }
}

interface thehdr {  // take #2
  interface A {  // Av
    interface statics { } // Ap
  }
  interface B extends A {  // Bv
    interface statics { } // Bp
  }
}

interface thehdr {  // take #3
  interface virtuals { interface A { } interface B extends A { } }
  interface statics { interface A { } interface B { } }
}

Finally, which version is the "real C"?  If you go by convenience, it is C$p,
since that is where you can find the simple version of every local symbol.
If you go by completeness and interoperability, it is C$v.  For that reason,
I like take #1 above.  The extracted header file will show occurrences of
"C.virtuals" where one might expect "C", and the reader can simply nod
and say "something about scoping C which I don't have to remember".

— John