[foreign] road to posix

Wed May 30 20:18:28 UTC 2018

Your cases A,B,C,E correspond to the built-in expression sub-language
I mentioned.  It's described here:
  https://gcc.gnu.org/onlinedocs/cpp/If.html#If
From reading this page, I realize there are a couple of ways this sub-language
is actually a distinct language of its own:
 - all values are 64 bits never 32 bits
 - unknown identifiers evaluate to zero (with optional warning)
 - the syntax defined(x) is allowed

Still, it's an important design feature for two reasons:
 - Programmers have a habit of aiming at it, if they think about #if at all.
 - We might be able to use the built-in cpp expression evaluator inside of clang.

But my "gut" tells me we will have to code our own evaluator.

— John

On May 30, 2018, at 4:09 AM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> 
> Thanks for the writeup John.
> 
> I was aware of the multiple layers involved here; what I wanted to say is that our basic strategy for dealing with (1) run afoul of several common cases. I've done a bit of classification of some of the most common idioms of object-like macros found in system headers, and I found the following sub-classification:
> 
> A) Simple values:
> 
> #  define SCHAR_MAX       127
> 
> B) Commented values
> 
> # define M_PIl              3.141592653589793238462643383279502884L /* pi */
> 
> C) Parenthesized values (both positive and negative)
> 
> define INTPTR_MAX              (9223372036854775807L)
> D) Cast-converted values
> 
> # define RTLD_DEFAULT      ((void *) 0)
> 
> E) Composition of macro values (typically in simple binary expressions)
> define LONG_MIN    (-LONG_MAX - 1L)
> 
> F) Struct/array init
> 
> #define PTHREAD_COND_INITIALIZER { { 0, 0, 0, 0, 0, (void *) 0, 0, 0 } }
> 
> G) Strings init
> 
> /usr/include/cpio.h:#define MAGIC       "070707"
> 
> Of these categories, (D), (F) and (G) falls into the uncommon category (cumulatively they account for 6% of cases, with (F) the least common and (G) the most common). 
> Of the remaining, commented values (B) are overwhelmingly common (~40%); (A) of course is also common (roughly at 20%); my classification is not sharp enough to discriminate between (C) and (E), but both account for around 15% of cases.
> 
> 
> Now, I think we can conclude that our support for object-like macros should probably target (A) and (B), as they account for the majority of the instances found. (C) would be nice to cover, if no other reason than to have some kind of regularity (if we support `#define A 1`, it's not clear why we would not be supporting `#define A (1)`, I believe). (E) is a stretch goal; it's the hardest of all the cases, given that the constant value might depend on other constant. Nevertheless, again, from a regularity perspective, it's kind of important to have it in, otherwise the results of jextract would be somehow erratic - for instance, this is my favourite example:
> 
> #   define LONG_MAX     2147483647L
> #  define LONG_MIN      (-LONG_MAX - 1L)
> One could argue that if jextract deals with LONG_MAX, LONG_MIN should also be dealt with, from an ease of use perspective.
> 
> Together, (A), (B), (C) and (E) account for some 75% of the use cases analyzed, which seems like an healthy proportion. Of course we can't get to them all using simple extraction-time analysis, but I think there is some argument in favor of special casing the most common idioms.
> 
> 
> In terms of the translation strategy, I fully agree that we should stay clear of static fields; while in most cases, the constant value will be some primitive type, which might suggest that static constants are good enough, in some of the more advanced use cases (cast to pointer, struct/array init, strings) there is some _allocation_ involved, which is almost always a red herring for something that should happen inside the binder.
> 
> 
> What I was noting yesterday, is that our simple extraction analysis only deals with (A). In doing so, the success ratio is very low, and the resulting user experience quite erratic. So, while I agree that we should invest on the other idioms for supporting macros (e.g. user-aided), I believe we should also improve the story for object-like macros up to a decent level.
> 
> Maurizio
> 
> 
> 
> On 29/05/18 22:51, John Rose wrote:
>> TL;DR: We need a variety of binding strategies to cover the variety
>> of macro types and complexities.  Complexities range from simple
>> constants to wacky throwaways; strategies range from pure metadata
>> to machine code snippets.  All of this is in reach, and gives us a big
>> step towards C++.  Also, beware of false friends.
>> 
>> On May 29, 2018, at 7:51 AM, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>> 
>>> So, to close the loop,  it seems like the problem I had with RTLD_DEFAULT was _not_caused by jextract not supporting constant macros - but, rather, with jextract not recognizing that specific kind of macro constant.
>> 
>> So, macro constants are a special case of what C/++ calls "object-like
>> macros", as opposed to "function-like macros".
>> 
>> The general case is intractible, of course, because it allows wacky stuff like:
>> 
>>   #define BEGIN {
>>   #define END }
>> 
>> C programs sometimes pull stunts like this to hack the language, including basic
>> notions of block structure.  Programmer cleverness is unbounded and unrestrained,
>> isn't it?
>> 
>> A real and inconvenient example of this programmer cleverness is the traditional
>> treatment of fields of struct stat.  They are often defined with object-like macros
>> as follows:
>> 
>>   #define st_atime st_atimespec.tv_sec
>>   /* source: /usr/include/sys/stat.h on MacOS */
>> 
>> We can't capture this in a systematic way.  Following Maurizio's write-up,
>> the plan of record for dealing with such things is to ask the jextract user to define
>> function-like macros and/or true functions which are canonical uses of such
>> oddly-shaped macros, and fold those into the jextract run, as auxiliary functions
>> (or macros) explicitly requested by the jextract operator, in a side-file.
>> 
>> (Sometimes the existing header files feature such macros as well; sys/stat.h
>> has S_ISBLK(m) for example.  In that case, if the user can supply just the
>> function type, and link it to the macro, jextract could make the wrapper.  This
>> is the way I handled S_ISBLK once upon a time, but I think for Panama it
>> is an unnecessary refinement.  See discussion below of level 2 vs. level 3.)
>> 
>> Capturing such irregular standard macros, as well as auxiliary functions
>> added by the jextract user, requires the jextract to wrap such expressions in
>> properly-typed code snippets, compile them, and deliver them as machine
>> code resources to the binder.  This is something we're just beginning to get
>> into with Panama, although I and others like Samuel have pulled this trick
>> on other projects.  This trick works for object-like macros as well as function-like
>> macros; simply (as we are discussing) wrap the object-like expression in a
>> no-argument code snippet, and invoke it appropriately from the binder
>> (either once statically or once per use).
>> 
>> But most object-like macros are much simpler, and we can be correspondingly
>> more automatic in handling them.  A canonical case is EOF from stdio.h,
>> which is usually just (-1).
>> 
>> (I think stdio.h is rich with canonical examples, including varargs functions
>> like printf and macro versions of fileno.  That, stat.h, and qsort in stdlib.h
>> are IMO the trifecta of tricky initial demos for header extraction.)
>> 
>> The special case of an integral constant expression is tractable (although
>> there are limitations and pitfalls).  The C preprocessor actually defines a
>> sub-language for C expressions, which is evaluated by the #if directive.
>> It would be reasonable for us to handle that sub-language specially and
>> precisely, although the sub-language includes recursive calls to object-like
>> and function-like macros (because the cpp macro-expands #if arguments).
>> For example:
>> 
>>   #define MASK(n) ((1<<(n))-1)
>>   #define THREE
>>   #define MASK3 MASK(THREE)
>>   #if (MASK3 == 7)  /* yes, it's true! */
>> 
>> The #if argument evaluates at cpp-time (and jextract time!) to 1.
>> 
>> When we get to non-integral types like NULL and RTLD_DEFAULT we have
>> a choice:  Try to decide if we can extend the constant-folding logic to deduce
>> a pointer bit-pattern plus a type, which we can then capture.
>> 
>> There are three levels of possible support for borderline cases like
>> RTLD_SUPPORT:
>> 
>> 1. Inferred metadata:  Infer the type and the bitwise value in jextract, and
>> store it all in metadata (annotations).
>> 
>> 2. Inferred snippet:  Infer the type, don't infer the value, but wrap the
>> expression in a snippet for the binder to invoke.
>> 
>> 3. Explicit snippet:  Take advice from the user, in the form of an auxiliary
>> function, which is wrapped in a snippet for the binder, and accompanied
>> by metadata derived from the type of the explicit auxiliary function.
>> 
>> (The fourth possibility of explicit metadata might be useful if there were
>> a hard requirement to specify that an expression like (-1) needed a surprising
>> type that couldn't be inferred.  But I think that corner case can be handled well
>> enough as an explicit snippet; just ask the user to specify the whole thing
>> and over-package it as a machine code, even if we could transmit it as
>> metadata.)
>> 
>> Inferring a type is tricky; sometimes there is ambiguity and sometimes
>> you just have to guess.  Guessing is OK for very simple macros like max.
>> Sometimes you want to be allowed to make several guesses and transmit
>> them all (as overloads).  Here's max:
>> 
>>   #define max(a, b) ((a) > (b) ? (a) : (b))  // int32? int64? void*? double?
>> 
>> Currently we do the inferred metadata (level 1), in a limited way, for simple
>> expressions.  The next thing we should do, I would say, is the other extreme,
>> the explicit snippet (level 3).  If that is usable, then we don't need to fiddle
>> the the hybrid inferred snippet, for macros.
>> 
>> The middle level (inferred snippet) will really come into its own with C++,
>> where almost every API point, for classes rich in inline access functions,
>> will have to be an inferred snippet.
>> 
>> Basically, each level is a distinct binding strategy, covering more or less
>> general cases, requiring more or less help from the jextract operator,
>> and passing information to the binder using a mix of strategies.
>> 
>> A final topic:  There needs to be some continuity between levels, in terms
>> of how jextract allocates them to carrier interfaces in Java, and how the
>> binder attaches them to those interfaces.  Small platform configuration
>> chages like -U__GNUC__ will shift C API points from one level to
>> another, and it would be best if jextract and the binder could hide those
>> effects.
>> 
>> This has a very practical implication:  We sometimes need to trade
>> for continuity over clever ad hoc use of Java features that are peculiar
>> to one level (one binding strategy).
>> 
>> I think the most common case where this trade-off is felt is with a simple
>> a level 1 object-like macro constant.  We might want to use a binding
>> strategy of a static interface constant (via ConstantValue attribute in
>> the classfile).  But this will disrupt Java APIs if there is any chance
>> that the macro might be re-extracted as a level 2 or level 3 value
>> (a snippet), since there is no way to bind the result of executing the
>> snippet to a ConstantValue attribute.  Being clever with an ad hoc
>> level-specific binding strategy risks make the Java API irregular.
>> 
>> In the case of ConstantValue attributes, we've repeatedly examined
>> them and found them inappropriate for Panama.  The biggest problem
>> is that Java client code might be compiled against one version of an
>> extracted interface, and then be used at runtime against a different
>> version (on a different platform).  In that case, the ConstantValue from
>> the first version would be used with code from the second version,
>> which would risk very subtle bugs.  The cure is to avoid the temptation
>> to be too clever in binding C constants to Java constants.
>> 
>> This trade-off tends to push *all* extracted constant values to be
>> function-like at the JVM level instead of value-like.  I think this is OK;
>> we don't have to use every last Java language feature to carry C APIs.
>> 
>> Some Java language features are "false friends"; they look good when
>> you first meet them but if you start to work with them, they cause problems.
>> 
>> Examples of false friends spring easily to mind:  C enums are so different
>> from Java enums that we cannot translate the former into the latter;
>> the most we could do is have a very special, rarely used switch for
>> the jextract operator to explicitly opt into Java enums.  The same
>> point holds for C arrays vs. Java arrays.  (Maurizio's write-up deals
>> with these cases nicely.) In C++ there are other false friends: C++
>> constructors are very different from Java constructors, etc.
>> 
>> I think that the very special Java feature of static constants (backed
>> by the ConstantValue attribute, a very special JVM feature) is that
>> kind of false friend.  The observation that switch statements in both
>> languages tie nicely to the feature adds to the attraction, but doesn't
>> cure the root problem, which is that ConstantValue attributes are an
>> out-of-band channel for class APIs that is totally outside of the normal
>> runtime linkage paradigm we are using in Panama.
>> 
>> — John
>> 
>> P.S. Maurizio's write-up is here:
>>   http://cr.openjdk.java.net/~mcimadamore/panama/panama-binder-v3.html <http://cr.openjdk.java.net/%7Emcimadamore/panama/panama-binder-v3.html>
>> 
>> Basic info on object-like vs. function-like macros can be found here:
>>   https://gcc.gnu.org/onlinedocs/cpp/Object-like-Macros.html <https://gcc.gnu.org/onlinedocs/cpp/Object-like-Macros.html>
>> 
>