From john.r.rose at oracle.com Sat Oct 29 02:49:50 2016 From: john.r.rose at oracle.com (John Rose) Date: Fri, 28 Oct 2016 19:49:50 -0700 Subject: notes on binding C++ Message-ID: Mikael and I have had a few good conversations about binding C++ to Java interfaces. The following notes are FTR, TBD, NYI, and every other TLA which implies "tentative". The job is constrained on one hand by the many degrees of freedom of C++ APIs, and on the other by the relative simplicity of Java interfaces. As a really simple example, a "source name" in C++ can often, but not always, be rendered directly as a "carrier name", of a method in a Java interface. The source name "throws" has to be perturbed to "fit" into the Java language. Similarly, source types and source scopes map to Java types and Java scopes in a complex way. Yet we think the usual result can be made to feel useful and even natural to the C++ programmer coding in Java syntax. Key principle: To express access ("bindings") to native APIs, we use Java interfaces only. No concrete types will appear to the end-user, since those would constrain the implementation underneath. (For example, we want to use value types when the time comes!) We don't even use the full range of behaviors interfaces can have: Default methods will not be used, or at most for simple "macro-like" patterns to provide abbreviations to end-users. (E.g., compose "getter" and "setter" methods on top of a single "get-address" method, for fields which can be addressed that way.) Static constant fields will be used rarely or never, even for C constants like EOF. (Such static constants can be created in a post-processing phase, by a "civilizer" tool.) What about interface subtyping? Can we make use of that? Consider a simple use-case: class A { virtual void vm(); } // "Virtual Method" class B : public A { virtual void vm(); virtual void vm2(); } In this case, in C++ every instance of B can be treated as if it were of type A. So it would make sense to have a pair of extracted Java interfaces with subtyping: interface A extends ObjectReference { void vm(); } interface B extends A, ObjectReference { void vm(); void vm2(); } (Here, the common super ObjectReference contains all methods relevant to managing C++ object references. It is TBD and NYI. It is probably a subtype of Reference.) There are two ways a native object of type B can be encapsulated in Java. If it is presented via an extracted API to Java using the static type B, then Java will wrap it in a pointer of type B, and the 'vm2' method will be present in Java. But if the native object is presented to Java via an static API type of A (which is perfectly valid in C++), then Java will wrap it in a pointer of type A. In that case, even though it is a native object of type B, there will be no visible clue to this fact, and no access to 'vm2' will be given. If Java executes the 'vm' method, B::vm will of course be executed, not because of Java dispatch, but because of C++ dispatch from a virtual call to vm in A. Suppose a Java pointer of type A really points to a native object of type B. Can the B type be recovered? Yes, in either of two ways: First, the user can issue an unsafe down-cast from A to B. This requires special knowledge, and/or boldness. Second, if C++ provides RTTI, and the jextract tool arranges (somehow, TBD, NYI) to consult this information, then a safe exception-throwing downcast can be supplied to the Java user, which would re-wrap the A pointer as a B pointer (after verifying correctness using RTTI). Note that this has to be a method call, not a Java cast from A to B. The wrapping of the B pointer in an A wrapper does not automatically include the ability to cast to the B type. Confusing? Yes. The confusion increases as we attempt to model C++ type system relations with Java type system relations. A way to simplify the situation, then, would be to remove the relation between Java types A and B, making them disjoint: interface A extends ObjectReference { void vm(); } interface B extends ObjectReference { void vm(); void vm2(); } The common reference type includes viewing operations which can used to recover the A-reference view from a B-reference. Unless the binder (for some reason) merges the implementations for A and B object references, a direct Java cast won't work: B myb = ?; A aview = (B) myb; // FAIL with ClassCastException What if I, as a Panama programmer, create an instance of B (native B wrapped in a B interface), and then try to pass it to a C++ API that accepts A? interface API { void foo(A obj); } If the interface types do not have a sub/super relation, that will fail to compile, won't it? API api = ?; B myb = ?; api.foo(myb); //FAIL if A/B disjoint A aview = myb; //FAIL if A/B disjoint Sounds like a flaw in the user model. We can fix this in part by asking the user to explicitly perform a C++ up-cast from B to A, using a view-as operation: api.foo(myb.viewAs(A.class)); /WIN even if A/B disjoint A aview = myb.viewAs(A.class); //WIN even if A/B disjoint The view-as operation can be pushed into the bound API, so it can be made automatic, if the extracted API is more weakly typed: interface API { void foo(/*A*/ ObjectReference obj); } But that seems too weak. Shall we go back to having B extend A? Wait a moment; there are more problems with that. In Java, interface methods are always virtual, but in C++ methods do not have to be virtual. Moreover, C++ has non-method C++ constructs which are not virtual. These include fields, constructors, qualified method invocations, and statics. What do we mean when we say "non-virtual construct" for a C++ type A? Given C++ types A and B extending A, a feature involving A is "virtual" if it passes this test: We create an object of type B and then apply the construct to the object in two ways, once via the static type B, and once (after assigning its reference to an A-reference) via the static type A. The test succeeds if the two applications of the same construct perform exactly the same actions. The test fails if the static type affects the semantics of the construct. For a given A/B pair, a construct is called "virtual" if it passes that test and "non-virtual" otherwise. (For convenience, let's say that constructs which apply to the type alone, and not to the instance, are also non-virtual, since the given test cannot be applied. Thus, statics and constructors are non-virtual.) Put a field of the same name in both A and B, and apply the virtuality test: class A { int f; } class B : public A { int f; } B myb = ?; int x1 = myb.f; A& mya = myb; int x2 = mya.f; What happens? Since B shadows A::f with its own B::f, and since the C++ language does not make fields virtual, it follows that the f-field is not virtual in A and B. The same thing happens for C++ methods which are not declared virtual, since they also shadow instead of override: class A { void pm(); } // "Plain Method" class B : public A { void pm(); } What happens if we model non-virtual features of C++ classes using Java interfaces? If the interfaces are disjoint, it would seem there is no ambiguity between which method is invoked: interface A extends ObjectReference { void pm(); } interface B extends ObjectReference { void pm(); } B myb = ?; myb.pm(); // => B::pm A aview = myb.viewAs(A.class); //WIN even if A/B disjoint aview.pm(); // => A::pm Can anything go wrong here? Just the usual thing to annoy a Panama programmer: Since A is not a super of B, we have to make an explicit view-as call instead of a cast or implicit conversion. What if we put back in the type relation (B extends A)? interface A extends ObjectReference { void pm(); } interface B extends A, ObjectReference { void pm(); } Notice what happens: The non-virtual method becomes partially virtualized, depending the ambiguity mentioned before. Let's grab a native B object and wrap it in a B pointer and then an A pointer: // api.h: inline B* make_B() { return new B(); } B myb = api.make_B(); // ad hoc wrapper over operator new myb.pm(); // => B::pm, so far so good A aview1 = myb; // Java ref-cast aview1.pm(); // => same B::pm, per rules of Java ref-casting A aview2 = myb.viewAs(A.class); aview2.pm(); // => A::pm, per rules of C++ In C++ the same object, of type both A and B, can respond in two different ways to the invocation of the method name "pm". In Java the same thing can happen, but the rules for selection depend on the wrapped pointer's dynamic type, not on the static type (as in C++). If you have a list of A's (List) in Java, and you invoke "pm" on each, you will get a mix of "A::pm" and "B::pm". If any of the objects in the list are true A's, then only "A::pm" will be reachable, of course. But if some are true B's, then a mix of either method can be reached, depending on which B's were wrapped as A's and which were wrapped as B's. (The same point applies to the field 'f' above. In effect, the field becomes partially virtualized, if B extends A.) There is a trick to keep interface subtyping, but prevent the unpredictable virtualization of non-virtuals. The trick is to mangle the method descriptors so that any non-virtual construct is represented by a name which will not be repeated in a subtype (for any construct at all). Descriptors can be mangled by name: interface A extends ObjectReference { void A$pm(); } // Exact translate of C++ A::pm! interface B extends A, ObjectReference { void B$pm(); } More subtly, they can be mangled by type, which is sometimes useful: interface A extends ObjectReference { void pm(A which); } interface B extends A, ObjectReference { void pm(B which); } The argument "which" contributes only a static type, and can be a null. (When binding such interfaces, the binder should detect non-virtual features that accidentally alias in their descriptors, and signal an error. This error checking can be helped if the methods which are truly virtual are distinguished from other methods. A "@Virtual" annotation would help a lot. Maybe also @NonVirtual, but then everything gets that annotation.) The mangling can be one-sided: interface A extends ObjectReference { void pm(); } // => A::pm interface B extends A, ObjectReference { void B$pm(); } // => B::pm (One-sided mangling in the super only makes sense if you can enumerate all its subs!) With such mangling, either one-sided or two-sided, the random devirtualization goes away. What remains is that one or both of the non-virtual method names has a surprising name to the Java programmer. To me, all this is surprising and irregular, enough so to motivate a ban on Java interface subtyping, for interfaces that express any non-virtual constructs. A graceful way to balance these concerns is to allow each C++ class to import as (at least) *two* interfaces, the "virtual-friendly" interface and the "plain" interface. Call these A$v and A$p, B$v and B$p. (Please think of those extra letters as superscripts, when writing on paper or whiteboard.) The virtual-friendly interfaces can safely represent the true C++ relation, but should only contain virtuals. interface A$v extends ObjectReference { @Virtual void vm(); } interface B$v extends A$v, ObjectReference { @Virtual void vm(); } (These interfaces can also contain mangled non-virtuals, if needed.) The non-virtual friendly interfaces would contain the other members: interface A$p extends ObjectReference { void pm(); } interface B$p extends ObjectReference { void pm(); } These are truly disjoint interfaces. No Java object would ever implement both of them, since it would be unable to provide a unique binding for its "pm" method. The four types {A,B}${v,p} can be thought of as four quadrants of a square containing all the methods of B. In the upper left and right are the virtual and non-virtual methods of A, while the lower left and right have the virtual and non-virtual methods of B. If you need to call vm, you can use either of the left-hand quadrants. But if you need to call pm, first you need to mentally qualify it as A::pm or B::pm, and then select the proper right-hand quadrant. Changing quadrants requires a manual view-as operation, except for the case of moving from B$v to its supertype A$v. How can be make this more user-friendly? Well, it would do no harm for the non-virtual friendly interface to extend the virtual ones: interface A$p extends A$v { void pm(); } interface B$p extends B$v { void pm(); } This produces a pattern we can call "spine and barb", by analogy with feathers. The A$v types have a deep inheritance chain. (That is the spine of the feather.) Each A$p type sticks off of from its corresponding A$v type. (That is a barb.) When you have a B$p in your hand, you can access all methods except those in the A$p quadrant. To get to A::pm, you need to do a view-as A$p. By contrast, if you have an A$p (more rare, I think), you have to do view-as B$p to get B::pm instead of A::pm. You also have to do a view-as to B$p or B$v in order to see any new virtuals in B (not already in A). Note that C++ allows virtual methods to be devirtualized. For example, a B object can call A::vm, even though B::vm overrides it. For Panama, this can be modeled in B$v, B$p, or in a fresh disjoint interface B$q. Here it is for B$v: interface A$v extends ObjectReference { @Virtual void vm(); void A$vm(); } // virtual vm, A::vm interface B$v extends A$v, ObjectReference { @Virtual void vm(); void B$vm(); } // virtual vm, B::vm Except for the disjoint case, B$v with its virtual methods is a super, so mangling ("A$vm") is needed to avoid colliding with the truly virtual method ("vm"). Are there other versions of a type besides B$v and B$p? Well, you could split out every different kind of class feature into its own separate interface. (Indeed, you could split every individual feature into its own interface, but that's clearly overkill.) It seems natural to consider three interfaces: B$v for the virtuals of B, B$s for every feature that does *not* operate on an instance of B, and B$v for everything else. Here B$s includes static fields and methods. Constructors are also in B$s, if we model them as static factory methods, or they could be in some B$c. Fields or qualified method references could be put into B$p, or their own B$f or B$q. So we could have up to half a dozen or more interfaces per C++ class. Or we could have as few as one interface, at the cost of heavy mangling. All of these types use a single underlying representation, which is a C pointer plus some metadata about type (and scope). Different views of the same object would keep the same pointer, but shift the metadata. For any given view, the access methods would bind through to the appropriate jextracted entry point. (There is machine code in there, but that's a different story!) When extracting native APIs from a header file, if there is more than one interface per class, we should carefully pick which interface to use at any given point. As Mikael has observed, it is useful in such cases to accept generically and produce specifically. ("Be liberal in what you accept and conservative in what you produce", or some such.) Function parameters should be of the form A$v and function returns of the form A$p. For read-write features (non-const fields), assuming reads are more common than writes, A$p (like a function return) seems the right choice. Why have even two interfaces for one class? If we mangle enough, we can get it down to one A$v. The problem with this is users will have to mangle all the time for every plain, non-virtual feature. That will get old, won't it? That's why the plain-friendly A$p interface seems to earn its keep. (But let's try it both ways and see which is easier overall.) On the other hand, we could put every non-virtual feature (except perhaps statics and constructors) into A$v, with mangling everywhere. That has an appealingly comprehensive feel: Everything is in one place, even if the labels are rather ornate. Then, for ease of use, define all the one-off "barb" interfaces A$p, which simply bridge from friendly non-qualified names (like "pm") into the qualified names in the main interface A$v (like "A$pm"). In that case, one interface contains everything, while another one provides some sugar for the user. Here's an example: class C : public A { public: int f; void pm(); // "plain" = non-virtual, non-static virtual void vm(); virtual void vm2() = 0; }; interface C$v extends A, ObjectReference { IntReference f$ref(); default int f$get() { return f$ref().getAsInt(); } default void f$set(int x) { f$ref().set(x); } void C$pm(); // C::pm only @Virtual void vm(); // true virtual void C$vm(); // qualified ref to C::vm @Virtual void vm2(); // no C$vm2, since it's abstract } interface C$p extends C$v { // we do not have to mangle stuff here, since nobody subclasses C$p default int f() { return f$get(); } // maybe? default void f(int x) { f$set(x); } // maybe? default void pm() { return C$pm(); } default void pm2() { return A$pm(); } } Note the final access to A::pm2 under its simple name. Since C$p will never be extended by subclass interfaces, there is no danger of accidental override of we "sugar up" all the names available via the super-type. So if class A contains plain methods "pm" and "pm2", both of those will be in the view A$p, but only the second can be in C$p, since C::pm is shadowing A::pm, and C$p must pick which one to bind to the name "pm". The A$v interface can be maximized if can find a way to include statics and constructors. This could be done (for example) by reifying a C++ null reference for each distinct object type. Then we would have something vaguely similar to JavaScript, where a prototype object is the parent of a class of children. class C { public: ? static void sm(); static int sf; C(); virtual ~C(); C(const C& that); }; interface C$v extends ObjectReference { ? void C$sm(); IntReference C$sf$ref(); default int C$sf$get() ? void C$new_(Scope s); @Virtual void delete(); void C$copy_(Scope s, C$v that); } interface C$p extends C$v { ? default void sm() { return C$sm(); } default int sf$ref() { return C$sf$ref(); } default int sf() { return C$sf$get(); } // maybe? default C$p new_(Scope s) { return C$new_(s); } default C$p copy_(Scope s, C$v that) { return C$copy_(s, that); } } Assuming we have to have at least two viewing interfaces per class, how should the interfaces be named? We don't need to mangle the names for the user if we use nested classes. Given a header file with two classes A, B, any of the following configurations would work: interface thehdr { // take #1 interface A { // Ap interface virtuals { } // Av } interface B { // Bp interface virtuals extends A.virtuals { } // Bv } } interface thehdr { // take #2 interface A { // Av interface statics { } // Ap } interface B extends A { // Bv interface statics { } // Bp } } interface thehdr { // take #3 interface virtuals { interface A { } interface B extends A { } } interface statics { interface A { } interface B { } } } Finally, which version is the "real C"? If you go by convenience, it is C$p, since that is where you can find the simple version of every local symbol. If you go by completeness and interoperability, it is C$v. For that reason, I like take #1 above. The extracted header file will show occurrences of "C.virtuals" where one might expect "C", and the reader can simply nod and say "something about scoping C which I don't have to remember". ? John