[foreign] road to posix
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Wed May 30 11:09:55 UTC 2018
Thanks for the writeup John.
I was aware of the multiple layers involved here; what I wanted to say
is that our basic strategy for dealing with (1) run afoul of several
common cases. I've done a bit of classification of some of the most
common idioms of object-like macros found in system headers, and I found
the following sub-classification:
A) Simple values:
# define SCHAR_MAX 127
B) Commented values
# define M_PIl 3.141592653589793238462643383279502884L /* pi */
C) Parenthesized values (both positive and negative)
define INTPTR_MAX (9223372036854775807L)
D) Cast-converted values
# define RTLD_DEFAULT ((void *) 0)
E) Composition of macro values (typically in simple binary expressions)
define LONG_MIN (-LONG_MAX - 1L)
F) Struct/array init
#define PTHREAD_COND_INITIALIZER { { 0, 0, 0, 0, 0, (void *) 0, 0, 0 } }
G) Strings init
/usr/include/cpio.h:#define MAGIC "070707"
Of these categories, (D), (F) and (G) falls into the uncommon category
(cumulatively they account for 6% of cases, with (F) the least common
and (G) the most common).
Of the remaining, commented values (B) are overwhelmingly common (~40%);
(A) of course is also common (roughly at 20%); my classification is not
sharp enough to discriminate between (C) and (E), but both account for
around 15% of cases.
Now, I think we can conclude that our support for object-like macros
should probably target (A) and (B), as they account for the majority of
the instances found. (C) would be nice to cover, if no other reason than
to have some kind of regularity (if we support `#define A 1`, it's not
clear why we would not be supporting `#define A (1)`, I believe). (E) is
a stretch goal; it's the hardest of all the cases, given that the
constant value might depend on other constant. Nevertheless, again, from
a regularity perspective, it's kind of important to have it in,
otherwise the results of jextract would be somehow erratic - for
instance, this is my favourite example:
# define LONG_MAX 2147483647L
# define LONG_MIN (-LONG_MAX - 1L)
One could argue that if jextract deals with LONG_MAX, LONG_MIN should
also be dealt with, from an ease of use perspective.
Together, (A), (B), (C) and (E) account for some 75% of the use cases
analyzed, which seems like an healthy proportion. Of course we can't get
to them all using simple extraction-time analysis, but I think there is
some argument in favor of special casing the most common idioms.
In terms of the translation strategy, I fully agree that we should stay
clear of static fields; while in most cases, the constant value will be
some primitive type, which might suggest that static constants are good
enough, in some of the more advanced use cases (cast to pointer,
struct/array init, strings) there is some _allocation_ involved, which
is almost always a red herring for something that should happen inside
the binder.
What I was noting yesterday, is that our simple extraction analysis only
deals with (A). In doing so, the success ratio is very low, and the
resulting user experience quite erratic. So, while I agree that we
should invest on the other idioms for supporting macros (e.g.
user-aided), I believe we should also improve the story for object-like
macros up to a decent level.
Maurizio
On 29/05/18 22:51, John Rose wrote:
> TL;DR: We need a variety of binding strategies to cover the variety
> of macro types and complexities. Complexities range from simple
> constants to wacky throwaways; strategies range from pure metadata
> to machine code snippets. All of this is in reach, and gives us a big
> step towards C++. Also, beware of false friends.
>
> On May 29, 2018, at 7:51 AM, Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>> So, to close the loop, it seems like the problem I had with
>> RTLD_DEFAULT was _not_caused by jextract not supporting constant
>> macros - but, rather, with jextract not recognizing that specific
>> kind of macro constant.
>
> So, macro constants are a special case of what C/++ calls "object-like
> macros", as opposed to "function-like macros".
>
> The general case is intractible, of course, because it allows wacky
> stuff like:
>
> #define BEGIN {
> #define END }
>
> C programs sometimes pull stunts like this to hack the language,
> including basic
> notions of block structure. Programmer cleverness is unbounded and
> unrestrained,
> isn't it?
>
> A real and inconvenient example of this programmer cleverness is the
> traditional
> treatment of fields of struct stat. They are often defined with
> object-like macros
> as follows:
>
> #define st_atime st_atimespec.tv_sec
> /* source: /usr/include/sys/stat.h on MacOS */
>
> We can't capture this in a systematic way. Following Maurizio's write-up,
> the plan of record for dealing with such things is to ask the jextract
> user to define
> function-like macros and/or true functions which are canonical uses of
> such
> oddly-shaped macros, and fold those into the jextract run, as
> auxiliary functions
> (or macros) explicitly requested by the jextract operator, in a side-file.
>
> (Sometimes the existing header files feature such macros as well;
> sys/stat.h
> has S_ISBLK(m) for example. In that case, if the user can supply just the
> function type, and link it to the macro, jextract could make the
> wrapper. This
> is the way I handled S_ISBLK once upon a time, but I think for Panama it
> is an unnecessary refinement. See discussion below of level 2 vs.
> level 3.)
>
> Capturing such irregular standard macros, as well as auxiliary functions
> added by the jextract user, requires the jextract to wrap such
> expressions in
> properly-typed code snippets, compile them, and deliver them as machine
> code resources to the binder. This is something we're just beginning
> to get
> into with Panama, although I and others like Samuel have pulled this trick
> on other projects. This trick works for object-like macros as well as
> function-like
> macros; simply (as we are discussing) wrap the object-like expression in a
> no-argument code snippet, and invoke it appropriately from the binder
> (either once statically or once per use).
>
> But most object-like macros are much simpler, and we can be
> correspondingly
> more automatic in handling them. A canonical case is EOF from stdio.h,
> which is usually just (-1).
>
> (I think stdio.h is rich with canonical examples, including varargs
> functions
> like printf and macro versions of fileno. That, stat.h, and qsort in
> stdlib.h
> are IMO the trifecta of tricky initial demos for header extraction.)
>
> The special case of an integral constant expression is tractable (although
> there are limitations and pitfalls). The C preprocessor actually
> defines a
> sub-language for C expressions, which is evaluated by the #if directive.
> It would be reasonable for us to handle that sub-language specially and
> precisely, although the sub-language includes recursive calls to
> object-like
> and function-like macros (because the cpp macro-expands #if arguments).
> For example:
>
> #define MASK(n) ((1<<(n))-1)
> #define THREE
> #define MASK3 MASK(THREE)
> #if (MASK3 == 7) /* yes, it's true! */
>
> The #if argument evaluates at cpp-time (and jextract time!) to 1.
>
> When we get to non-integral types like NULL and RTLD_DEFAULT we have
> a choice: Try to decide if we can extend the constant-folding logic
> to deduce
> a pointer bit-pattern plus a type, which we can then capture.
>
> There are three levels of possible support for borderline cases like
> RTLD_SUPPORT:
>
> 1. Inferred metadata: Infer the type and the bitwise value in
> jextract, and
> store it all in metadata (annotations).
>
> 2. Inferred snippet: Infer the type, don't infer the value, but wrap the
> expression in a snippet for the binder to invoke.
>
> 3. Explicit snippet: Take advice from the user, in the form of an
> auxiliary
> function, which is wrapped in a snippet for the binder, and accompanied
> by metadata derived from the type of the explicit auxiliary function.
>
> (The fourth possibility of explicit metadata might be useful if there were
> a hard requirement to specify that an expression like (-1) needed a
> surprising
> type that couldn't be inferred. But I think that corner case can be
> handled well
> enough as an explicit snippet; just ask the user to specify the whole
> thing
> and over-package it as a machine code, even if we could transmit it as
> metadata.)
>
> Inferring a type is tricky; sometimes there is ambiguity and sometimes
> you just have to guess. Guessing is OK for very simple macros like max.
> Sometimes you want to be allowed to make several guesses and transmit
> them all (as overloads). Here's max:
>
> #define max(a, b) ((a) > (b) ? (a) : (b)) // int32? int64? void*?
> double?
>
> Currently we do the inferred metadata (level 1), in a limited way, for
> simple
> expressions. The next thing we should do, I would say, is the other
> extreme,
> the explicit snippet (level 3). If that is usable, then we don't need
> to fiddle
> the the hybrid inferred snippet, for macros.
>
> The middle level (inferred snippet) will really come into its own with
> C++,
> where almost every API point, for classes rich in inline access functions,
> will have to be an inferred snippet.
>
> Basically, each level is a distinct binding strategy, covering more or
> less
> general cases, requiring more or less help from the jextract operator,
> and passing information to the binder using a mix of strategies.
>
> A final topic: There needs to be some continuity between levels, in terms
> of how jextract allocates them to carrier interfaces in Java, and how the
> binder attaches them to those interfaces. Small platform configuration
> chages like -U__GNUC__ will shift C API points from one level to
> another, and it would be best if jextract and the binder could hide those
> effects.
>
> This has a very practical implication: We sometimes need to trade
> for continuity over clever ad hoc use of Java features that are peculiar
> to one level (one binding strategy).
>
> I think the most common case where this trade-off is felt is with a simple
> a level 1 object-like macro constant. We might want to use a binding
> strategy of a static interface constant (via ConstantValue attribute in
> the classfile). But this will disrupt Java APIs if there is any chance
> that the macro might be re-extracted as a level 2 or level 3 value
> (a snippet), since there is no way to bind the result of executing the
> snippet to a ConstantValue attribute. Being clever with an ad hoc
> level-specific binding strategy risks make the Java API irregular.
>
> In the case of ConstantValue attributes, we've repeatedly examined
> them and found them inappropriate for Panama. The biggest problem
> is that Java client code might be compiled against one version of an
> extracted interface, and then be used at runtime against a different
> version (on a different platform). In that case, the ConstantValue from
> the first version would be used with code from the second version,
> which would risk very subtle bugs. The cure is to avoid the temptation
> to be too clever in binding C constants to Java constants.
>
> This trade-off tends to push *all* extracted constant values to be
> function-like at the JVM level instead of value-like. I think this is OK;
> we don't have to use every last Java language feature to carry C APIs.
>
> Some Java language features are "false friends"; they look good when
> you first meet them but if you start to work with them, they cause
> problems.
>
> Examples of false friends spring easily to mind: C enums are so different
> from Java enums that we cannot translate the former into the latter;
> the most we could do is have a very special, rarely used switch for
> the jextract operator to explicitly opt into Java enums. The same
> point holds for C arrays vs. Java arrays. (Maurizio's write-up deals
> with these cases nicely.) In C++ there are other false friends: C++
> constructors are very different from Java constructors, etc.
>
> I think that the very special Java feature of static constants (backed
> by the ConstantValue attribute, a very special JVM feature) is that
> kind of false friend. The observation that switch statements in both
> languages tie nicely to the feature adds to the attraction, but doesn't
> cure the root problem, which is that ConstantValue attributes are an
> out-of-band channel for class APIs that is totally outside of the normal
> runtime linkage paradigm we are using in Panama.
>
> — John
>
> P.S. Maurizio's write-up is here:
> http://cr.openjdk.java.net/~mcimadamore/panama/panama-binder-v3.html
> <http://cr.openjdk.java.net/%7Emcimadamore/panama/panama-binder-v3.html>
>
> Basic info on object-like vs. function-like macros can be found here:
> https://gcc.gnu.org/onlinedocs/cpp/Object-like-Macros.html
>
More information about the panama-dev
mailing list