[foreign] some JMH benchmarks

Tue Sep 18 09:43:43 UTC 2018

No discernible difference with static (in fact getpid was already using 
static but I forgot that in getpid). I wouldn't expect the JIT not to 
optimize that (given that the receiver was always the same value created 
once at the beginning of the class' lifecycle).

Maurizio

On 18/09/18 01:51, Samuel Audet wrote:
> Thanks! Also, the native declaration for exp() isn't "static". That 
> might impose a large overhead...
>
> 2018年9月18日(火) 0:59 Maurizio Cimadamore <maurizio.cimadamore at oracle.com 
> <mailto:maurizio.cimadamore at oracle.com>>:
>
>     For the records, here's what I get for all the three benchmarks if I
>     compile the JNI code with -O3:
>
>     Benchmark                          Mode  Cnt Score Error  Units
>     PanamaBenchmark.testJNIExp        thrpt    5  28575269.294 ±
>     1907726.710  ops/s
>     PanamaBenchmark.testJNIJavaQsort  thrpt    5    372148.433 ±
>     27178.529
>     ops/s
>     PanamaBenchmark.testJNIPid        thrpt    5  59240069.011 ±
>     403881.697
>     ops/s
>
>     The first and second benchmarks get faster and very close to the
>     'direct' optimization numbers in [1]. Surprisingly, the last
>     benchmark
>     (getpid) is quite slower. I've been able to reproduce across multiple
>     runs; for that benchmark omitting O3 seems to be the achieve best
>     results, not sure why. It starts of faster (around in the first
>     couple
>     of warmup iterations, but then it goes slower in all the other runs -
>     presumably it interacts badly with the C2 generated code. For
>     instance,
>     this is a run with O3 enabled:
>
>     # Run progress: 66.67% complete, ETA 00:01:40
>     # Fork: 1 of 1
>     # Warmup Iteration   1: 65182202.653 ops/s
>     # Warmup Iteration   2: 64900639.094 ops/s
>     # Warmup Iteration   3: 59314945.437 ops/s
>     <---------------------------------
>     # Warmup Iteration   4: 59269007.877 ops/s
>     # Warmup Iteration   5: 59239905.163 ops/s
>     Iteration   1: 59300748.074 ops/s
>     Iteration   2: 59249666.044 ops/s
>     Iteration   3: 59268597.051 ops/s
>     Iteration   4: 59322074.572 ops/s
>     Iteration   5: 59059259.317 ops/s
>
>     And this is a run with O3 disabled:
>
>     # Run progress: 0.00% complete, ETA 00:01:40
>     # Fork: 1 of 1
>     # Warmup Iteration   1: 55882128.787 ops/s
>     # Warmup Iteration   2: 53102361.751 ops/s
>     # Warmup Iteration   3: 66964755.699 ops/s
>     <---------------------------------
>     # Warmup Iteration   4: 66414428.355 ops/s
>     # Warmup Iteration   5: 65328475.276 ops/s
>     Iteration   1: 64229192.993 ops/s
>     Iteration   2: 65191719.319 ops/s
>     Iteration   3: 65352022.471 ops/s
>     Iteration   4: 65152090.426 ops/s
>     Iteration   5: 65320545.712 ops/s
>
>
>     In both cases, the 3rd warmup execution sees a performance jump -
>     with
>     O3, the jump is backwards, w/o O3 the jump is forward, which is quite
>     typical for a JMH benchmark as C2 optimization will start to kick in.
>
>     For these reasons, I'm reluctant to update my benchmark numbers to
>     reflect the O3 behavior (although I agree that, since the Hotspot
>     code
>     is compiled with that optimization it would make more sense to use
>     that
>     as a reference).
>
>     Maurizio
>
>     [1] -
>     http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>     <http://cr.openjdk.java.net/%7Emcimadamore/panama/foreign-jmh.txt>
>
>
>
>     On 17/09/18 16:18, Maurizio Cimadamore wrote:
>     >
>     >
>     > On 17/09/18 15:08, Samuel Audet wrote:
>     >> Yes, the blackhole or the random number doesn't make any
>     difference,
>     >> but not calling gcc with -O3 does. Running the compiler with
>     >> optimizations on is pretty common, but they are not enabled by
>     default.
>     > A bit better
>     >
>     > PanamaBenchmark.testMethod  thrpt    5  28018170.076 ±
>     8491668.248 ops/s
>     >
>     > But not much of a difference (I did not expected much, as the
>     body of
>     > the native method is extremely simple).
>     >
>     > Maurizio
>     >>
>     >> Samuel
>     >>
>     >>
>     >> 2018年9月17日(月) 21:37 Maurizio Cimadamore
>     >> <maurizio.cimadamore at oracle.com
>     <mailto:maurizio.cimadamore at oracle.com>
>     >> <mailto:maurizio.cimadamore at oracle.com
>     <mailto:maurizio.cimadamore at oracle.com>>>:
>     >>
>     >>     Hi Samuel,
>     >>     I was planning to upload the benchmark IDE project in the near
>     >>     future (I
>     >>     need to clean that up a bit, so that it can be opened at ease):
>     >>
>     >>     My getpid example is like this; this is the Java decl:
>     >>
>     >>     public class GetPid {
>     >>
>     >>          static {
>     >>              System.loadLibrary("getpid");
>     >>          }
>     >>
>     >>          native static long getpid();
>     >>
>     >>          native double exp(double base);
>     >>     }
>     >>
>     >>     This is the JNI code:
>     >>
>     >>     JNIEXPORT jlong JNICALL Java_org_panama_GetPid_getpid
>     >>        (JNIEnv *env, jobject recv) {
>     >>         return getpid();
>     >>     }
>     >>
>     >>     JNIEXPORT jdouble JNICALL Java_org_panama_GetPid_exp
>     >>        (JNIEnv *env, jobject recv, jdouble arg) {
>     >>         return exp(arg);
>     >>     }
>     >>
>     >>     And this is the benchmark:
>     >>
>     >>     class PanamaBenchmark {
>     >>          static GetPid pidlib = new GetPid();
>     >>
>     >>          @Benchmark
>     >>          public long testJNIPid() {
>     >>              return pidlib.getpid();
>     >>          }
>     >>
>     >>          @Benchmark
>     >>          public double testJNIExp() {
>     >>              return pidlib.exp(10d);
>     >>          }
>     >>     }
>     >>
>     >>
>     >>     I think this should be rather standard?
>     >>
>     >>     I'm on Ubuntu 16.04.1, and using GCC 5.4.0. The command I
>     use to
>     >>     compile
>     >>     the C lib is this:
>     >>
>     >>     gcc -I<path to jni,h> -l<path to jni lib> -shared -o
>     libgetpid.so
>     >>     -fPIC
>     >>     GetPid.c
>     >>
>     >>     One difference I see between our two examples is the use of
>     >>     BlackHole.
>     >>     In my bench, I'm just returning the call to 'exp' - which
>     should be
>     >>     equivalent, and, actually, preferred, as described here:
>     >>
>     >>
>     http://hg.openjdk.java.net/code-tools/jmh/file/3769055ad883/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_09_Blackholes.java#l51
>     >>
>     >>     Another minor difference I see is that I pass a constant
>     argument,
>     >>     while
>     >>     you generate a random number on each iteration.
>     >>
>     >>     I tried to cut and paste your benchmark and I got this:
>     >>
>     >>     Benchmark                    Mode Cnt         Score Error Units
>     >>     PanamaBenchmark.testMethod  thrpt    5 26362701.827 ±
>     >>     1357012.981  ops/s
>     >>
>     >>
>     >>     Which looks exactly the same as what I've got. So, for whatever
>     >>     reason,
>     >>     my machine seems to be slower than the one you are using. For
>     >> what is
>     >>     worth, this website [1] seems to confirm the difference.
>     While clock
>     >>     speeds are similar, your machine has more Ghz in Turbo
>     boost mode
>     >> and
>     >>     it's 3-4 years newer than mine, so I'd expect that to make a
>     >>     difference
>     >>     in terms of internal optimizations etc. Note that I'm able
>     to beat
>     >>     the
>     >>     numbers of my workstation using my laptop which sports a
>     slightly
>     >>     higher
>     >>     frequency and only has 2 cores and 8G of RAM.
>     >>
>     >>     [1] -
>     >>
>     https://www.cpubenchmark.net/compare/Intel-Xeon-E5-2673-v4-vs-Intel-Xeon-E5-2665/2888vs1439
>     >>
>     >>     Maurizio
>     >>
>     >>
>     >>
>     >>     On 17/09/18 11:00, Samuel Audet wrote:
>     >>     > Thanks for the figures Maurizio! It's finally good to be
>     >>     speaking in
>     >>     > numbers. :)
>     >>     >
>     >>     > However, you're not providing a lot of details about how you
>     >>     actually
>     >>     > ran the experiments. So I've decided to run a JMH
>     benchmark on
>     >>     what we
>     >>     > get by default with JavaCPP and this declaration:
>     >>     >
>     >>     >     @Platform(include = "math.h")
>     >>     >     public class MyBenchmark {
>     >>     >         static { Loader.load(); }
>     >>     >
>     >>     >         @NoException
>     >>     >         public static native double exp(double x);
>     >>     >
>     >>     >         @State(Scope.Thread)
>     >>     >         public static class MyState {
>     >>     >             double x;
>     >>     >
>     >>     >             @Setup(Level.Iteration)
>     >>     >             public void setupMethod() {
>     >>     >                 x = Math.random();
>     >>     >             }
>     >>     >         }
>     >>     >
>     >>     >         @Benchmark
>     >>     >         public void testMethod(MyState s, Blackhole bh) {
>     >>     >             bh.consume(exp(s.x));
>     >>     >         }
>     >>     >     }
>     >>     >
>     >>     > The relevant portion of generated JNI looks like this:
>     >>     >
>     >>     >     JNIEXPORT jdouble JNICALL
>     >>     Java_org_sample_MyBenchmark_exp(JNIEnv*
>     >>     > env, jclass cls, jdouble arg0) {
>     >>     >         jdouble rarg = 0;
>     >>     >         double rval = exp(arg0);
>     >>     >         rarg = (jdouble)rval;
>     >>     >         return rarg;
>     >>     >     }
>     >>     >
>     >>     > And with access to just 2 virtual cores of an Intel(R)
>     Xeon(R) CPU
>     >>     > E5-2673 v4 @ 2.30GHz and 8 GB of RAM on the cloud (so
>     probably
>     >>     slower
>     >>     > than your E5-2665 @ 2.40GHz) running Ubuntu 14.04 with
>     GCC 4.9 and
>     >>     > OpenJDK 8, I get these numbers:
>     >>     > Benchmark                Mode  Cnt Score        Error Units
>     >>     > MyBenchmark.testMethod  thrpt   25 37183556.094 ± 460795.746
>     >> ops/s
>     >>     >
>     >>     > I'm not sure how that compares with your numbers exactly,
>     but it
>     >>     does
>     >>     > seem to me that what you get for JNI is a bit low. If you
>     could
>     >>     > provide more details about how to reproduce your results,
>     that
>     >>     would
>     >>     > be great.
>     >>     >
>     >>     > Samuel
>     >>     >
>     >>     >
>     >>     > On 09/14/2018 10:19 PM, Maurizio Cimadamore wrote:
>     >>     >> Hi,
>     >>     >> over the last few weeks I've been busy playing with
>     Panama and
>     >>     >> assessing performances with JMH. For those just
>     interested in raw
>     >>     >> numbers, the results of my explorations can be found
>     here [1].
>     >>     But as
>     >>     >> all benchmarks, I think it's better to spend few words to
>     >>     understand
>     >>     >> what these numbers actually _mean_.
>     >>     >>
>     >>     >> To evaluate the performances of Panama I have first
>     created a
>     >>     >> baseline using JNI - more specifically I wanted to assess
>     >>     >> performances of three calls (all part of the C std library),
>     >>     namely
>     >>     >> `getpid`, `exp` and `qsort`.
>     >>     >>
>     >>     >> The first example is the de facto benchmark for FFIs -
>     since it
>     >>     does
>     >>     >> relatively little computation, it is a good test to
>     measure the
>     >>     >> 'latency' of the FFI approach (e.g. how long does it
>     take to
>     >> go to
>     >>     >> native). The second example is also relatively simple,
>     but the
>     >>     this
>     >>     >> time the function takes a double argument. The third test is
>     >>     akin to
>     >>     >> an FFI torture test, since not only it passes
>     substantially more
>     >>     >> arguments (4) but one of these arguments is also a
>     callback - a
>     >>     >> pointer to a function that is used to sort the contents
>     of the
>     >>     input
>     >>     >> array.
>     >>     >>
>     >>     >> As expected, the first batch of JNI results confirms our
>     >>     >> expectations: `getpid` is the fastest, followed by
>     `exp`, and
>     >> then
>     >>     >> followed by `qsort`. Note that qsort is not even close in
>     >> terms of
>     >>     >> raw numbers to the other two tests - that's because, to
>     sort the
>     >>     >> array we need to do (N * log N) upcalls into Java. In the
>     >>     benchmark,
>     >>     >> N = 8 and we do the upcalls using the JNI function
>     >>     >> JNIEnv::CallIntMethod.
>     >>     >>
>     >>     >> Now let's examine the second batch of results; these call
>     >>     `getpid`,
>     >>     >> `exp` and `qsort` using Panama. The numbers here are
>     considerably
>     >>     >> lower than the JNI ones for all the three benchmark -
>     although
>     >> the
>     >>     >> first two seem to be the most problematic. To explain these
>     >>     results
>     >>     >> we need to peek under the hood. Panama implements
>     foreign calls
>     >>     >> through a so called 'universal adapter' which, given a
>     calling
>     >>     scheme
>     >>     >> and a bunch of arguments (machine words) shuffles these
>     >>     arguments in
>     >>     >> the right registers/stack slot and then jumps to the target
>     >> native
>     >>     >> function - after which another round of adaptation must be
>     >>     performed
>     >>     >> (e.g. to recover the return value from the right
>     register/memory
>     >>     >> location).
>     >>     >>
>     >>     >> Needless to say, all this generality comes at a cost -
>     some of
>     >> the
>     >>     >> cost is in Java - e.g. all arguments have to be packaged up
>     >> into a
>     >>     >> long array (although this component doesn't seem to show up
>     >>     much in
>     >>     >> the generated JVM compiled code). A lot of the cost is
>     in the
>     >>     adapter
>     >>     >> logic itself - which has to look at the 'call recipe'
>     and move
>     >>     >> arguments around accordingly - more specifically, in
>     order to
>     >> call
>     >>     >> the native function, the adapter creates a bunch of
>     helper C++
>     >>     >> objects and structs which model the CPU state (e.g. in the
>     >>     >> ShuffleDowncallContext struct, we find a field for each
>     >>     register to
>     >>     >> be modeled in the target architecture). The adapter has to
>     >>     first move
>     >>     >> the values coming from the Java world (stored in the
>     >>     aforementioned
>     >>     >> long array) into the right context fields (and it needs
>     to do
>     >>     so by
>     >>     >> looking at the recipe, which involves iteration over the
>     recipe
>     >>     >> elements). After that's done, we can jump into the assembly
>     >>     stub that
>     >>     >> does the native call - this stub will take as input one
>     of those
>     >>     >> ShuffleDowncallContext structure and will load the
>     corresponding
>     >>     >> registers/create necessary stack slots ahead of the call.
>     >>     >>
>     >>     >> As you can see, there's quite a lot of action going on here,
>     >>     and this
>     >>     >> explains the benchmark numbers; of course, if you are
>     calling a
>     >>     >> native function that does a lot of computation, this
>     adaptation
>     >>     cost
>     >>     >> will wash out - but for relatively quick calls such as
>     'getpid'
>     >>     and
>     >>     >> 'exp' the latency dominates the picture.
>     >>     >>
>     >>     >> Digression: the callback logic suffers pretty much from
>     the same
>     >>     >> issues, albeit in a reversed order - this time it's the
>     Java code
>     >>     >> which receives a 'snapshot' of the register values from a
>     >>     generated
>     >>     >> assembly adapter; the Java code can then read such values
>     >>     (using the
>     >>     >> Pointer API), turn them into Java objects, call the
>     target Java
>     >>     >> method and store the results (after another conversion)
>     in the
>     >>     right
>     >>     >> location of the snapshot. The assembly adapter will then
>     pick
>     >>     up the
>     >>     >> values set onto the snapshot by the Java code, store
>     them into
>     >> the
>     >>     >> corresponding registers and return control to the native
>     >>     callee. In
>     >>     >> the remainder of this email we will not discuss callbacks in
>     >>     details
>     >>     >> - we will just posit that for any optimization technique
>     that
>     >>     can be
>     >>     >> defined, there exists a _dual_ strategy that works with
>     >> callbacks.
>     >>     >>
>     >>     >> How can we make sensible native calls go faster? Well, one
>     >> obvious
>     >>     >> way would be to optimize the universal adapter so that
>     we get a
>     >>     >> specialized assembly stub for each code shape. If we do
>     that,
>     >>     we can
>     >>     >> move pretty much all of the computation described above from
>     >>     >> execution time to the stub generation time, so that, by the
>     >>     time we
>     >>     >> have to call the native function, we just have to
>     populate the
>     >>     right
>     >>     >> registers (the specialized stub knows where to find
>     them) and
>     >>     jump.
>     >>     >> While this sounds a good approach, it feels like there's
>     also a
>     >>     move
>     >>     >> for the JIT somewhere in there - after all, the JVM
>     knows which
>     >>     calls
>     >>     >> are hot and in need for optimization, so perhaps this
>     >>     specialization
>     >>     >> process (some or all of it) could happen dynamically.
>     And this is
>     >>     >> indeed an approach we'd like to aim for in the long run.
>     >>     >>
>     >>     >> Now, few years ago, Vlad put together a patch which now
>     lives
>     >>     in the
>     >>     >> 'linkToNative' branch [6, 7] - the goal of this patch is to
>     >>     implement
>     >>     >> the approach described above: generate a specialized
>     assembly
>     >>     adapter
>     >>     >> for a given native signature, and then leverage the JIT to
>     >>     optimize
>     >>     >> it away, turning the adapter into a bare, direct, native
>     method
>     >>     call.
>     >>     >> As you can see from the third batch of benchmarks, if we
>     tweak
>     >>     Panama
>     >>     >> to use the linkToNative machinery, the speed up is really
>     >>     impressive,
>     >>     >> and we end up being much faster than JNI (up to 4x for
>     getPid).
>     >>     >>
>     >>     >> Unfortunately, the technology in the linkToNative branch
>     is not
>     >>     ready
>     >>     >> from prime time (yet) - first, it doesn't cover some
>     useful cases
>     >>     >> (e.g. varargs, multiple returns via registers, arguments
>     >> passed in
>     >>     >> memory). That is, the logic assumes there's a 1-1 mapping
>     >>     between a
>     >>     >> Java signature and the native function to be called -
>     and that
>     >> the
>     >>     >> arguments passed from Java will either be longs or doubles.
>     >>     While we
>     >>     >> can workaround this limitation and define the necessary
>     >>     marshalling
>     >>     >> logic in Java (as I have done to run this benchmark),
>     some of the
>     >>     >> limitations (multiple returns, structs passed by value which
>     >>     are too
>     >>     >> big) cannot simply be worked around. But that's fine, we
>     can
>     >> still
>     >>     >> have a fast path for those calls which have certain
>     >>     characteristics
>     >>     >> and a slow path (through the universal adapter) for all the
>     >>     other calls.
>     >>     >>
>     >>     >> But there's a second and more serious issue lurking: as
>     you can
>     >>     see
>     >>     >> in the benchmark, I was not able to get the qsort benchmark
>     >>     running
>     >>     >> when using the linkToNative backend. The reason is that the
>     >>     >> linkToNative code is still pretty raw, and it doesn't fully
>     >>     adhere to
>     >>     >> the JVM internal conventions - e.g. there are missing
>     thread
>     >> state
>     >>     >> transitions which, in the case of upcalls into Java, create
>     >> issues
>     >>     >> when it comes to garbage collection, as the GC cannot
>     parse the
>     >>     >> native stack in the correct way.
>     >>     >>
>     >>     >> This means that, while there's a clear shining path ahead of
>     >>     us, it
>     >>     >> is simply too early to just use the linkToNative backend
>     from
>     >>     Panama.
>     >>     >> For this reason, I've been looking into some kind of stopgap
>     >>     solution
>     >>     >> - another way of optimizing native calls (and upcalls into
>     >>     Java) that
>     >>     >> doesn't require too much VM magic. Now, a crucial
>     observation is
>     >>     >> that, in many native calls, there is indeed a 1-1
>     mapping between
>     >>     >> Java arguments and native arguments (and back, for return
>     >> values).
>     >>     >> That is, we can think of calling a native function as a
>     process
>     >>     that
>     >>     >> takes a bunch of Java arguments, turn them into native
>     arguments
>     >>     >> (either double or longs), calls the native methods and
>     then turns
>     >>     >> back the result into Java.
>     >>     >>
>     >>     >> The mapping between Java arguments and C values is quite
>     simple:
>     >>     >>
>     >>     >> * primitives: either long or double, depending on
>     whether they
>     >>     >> describe an integral value or a floating point one.
>     >>     >> * pointers: they convert to a long
>     >>     >> * callbacks: they also convert to a long
>     >>     >> * structs: they are recursively decomposed into fields
>     and each
>     >>     field
>     >>     >> is marshalled separately (assuming the struct is not too
>     big, in
>     >>     >> which case is passed in memory)
>     >>     >>
>     >>     >> So, in principle, we could define a bunch of native entry
>     >>     points in
>     >>     >> the VM, one per shape, which take a bunch of long and
>     doubles and
>     >>     >> call an underlying function with those arguments. For
>     instance,
>     >>     let's
>     >>     >> consider the case of a native function which is modelled in
>     >>     Java as:
>     >>     >>
>     >>     >> int m(Pointer<Foo>, double)
>     >>     >>
>     >>     >> To call this native function we have to first turn the Java
>     >>     arguments
>     >>     >> into a (long, double) pair. Then we need to call a
>     native adapter
>     >>     >> that looks like the following:
>     >>     >>
>     >>     >> jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused,
>     jlong
>     >>     addr,
>     >>     >> jlong arg0, jdouble arg1) {
>     >>     >>      return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
>     >>     >> }
>     >>     >>
>     >>     >> And this will take care of calling the native function and
>     >>     returning
>     >>     >> the value back. This is, admittedly, a very simple
>     solution; of
>     >>     >> course there are limitations: we have to define a bunch of
>     >>     >> specialized native entry point (and Java entry points, for
>     >>     >> callbacks). But here we can play a trick: most of
>     moderns ABI
>     >> pass
>     >>     >> arguments in registers; for instance System V ABI [5]
>     uses up
>     >> to 6
>     >>     >> (!!) integer registers and 7 (!!) MMXr registers for FP
>     values
>     >>     - this
>     >>     >> gives us a total of 13 registers available for argument
>     passing.
>     >>     >> Which covers quite a lot of cases. Now, if we have a
>     call where
>     >>     _all_
>     >>     >> arguments are passed in registers, then the order in
>     which these
>     >>     >> arguments are declared in the adapter doesn't matter!
>     That is,
>     >>     since
>     >>     >> FP-values will always be passed in different register from
>     >>     integral
>     >>     >> values, we can just define entry points which look like
>     these:
>     >>     >>
>     >>     >> invokeNative_V_DDDDD
>     >>     >> invokeNative_V_JDDDD
>     >>     >> invokeNative_V_JJDDD
>     >>     >> invokeNative_V_JJJDD
>     >>     >> invokeNative_V_JJJJD
>     >>     >> invokeNative_V_JJJJJ
>     >>     >>
>     >>     >> That is, for a given arity (5 in this case), we can just put
>     >>     all long
>     >>     >> arguments in front, and the double arguments after that.
>     That
>     >>     is, we
>     >>     >> don't need to generate all possible permutations of J/D
>     in all
>     >>     >> positions - as the adapter will always do the same thing
>     (read:
>     >>     load
>     >>     >> from same registers) for all equivalent combinations. This
>     >>     keeps the
>     >>     >> number of entry points in check - and it also poses some
>     >>     challenges
>     >>     >> to the Java logic in charge of marshalling/unmarshalling, as
>     >>     there's
>     >>     >> an extra permutation step involved (although that is not
>     >> something
>     >>     >> super-hard to address).
>     >>     >>
>     >>     >> You can see the performance numbers associated with this
>     >>     invocation
>     >>     >> scheme (which I've dubbed 'direct') in the 4th batch of the
>     >>     benchmark
>     >>     >> results. These numbers are on par (and slightly better) with
>     >>     JNI in
>     >>     >> all the three cases considered which is, I think, a very
>     positive
>     >>     >> result, given that to write these benchmarks I did not
>     have to
>     >>     write
>     >>     >> a single line of JNI code. In other words, this
>     optimization
>     >> gives
>     >>     >> you the same speed as JNI, with improved ease of use (**).
>     >>     >>
>     >>     >> Now, since the 'direct' optimization builds on top of the VM
>     >>     native
>     >>     >> call adapters, this approach is significantly more
>     robust than
>     >>     >> linkToNative and I have not run into any weird VM
>     crashes when
>     >>     >> playing with it. The downside of that, is that, for obvious
>     >>     reasons,
>     >>     >> this approach cannot get much faster than JNI - that is, it
>     >> cannot
>     >>     >> get close to the numbers obtained with the linkToNative
>     backend,
>     >>     >> which features much deeper optimizations. But I think that,
>     >>     despite
>     >>     >> its limitations, it's still a good opportunistic improvement
>     >>     that is
>     >>     >> worth pursuing in the short term (while we sort out the
>     >>     linkToNative
>     >>     >> story). For this reason, I will soon be submitting a
>     review which
>     >>     >> incorporates the changes for the 'direct' invocation
>     schemes.
>     >>     >>
>     >>     >> Cheers
>     >>     >> Maurizio
>     >>     >>
>     >>     >> [1] -
>     >> http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>     <http://cr.openjdk.java.net/%7Emcimadamore/panama/foreign-jmh.txt>
>     >> <http://cr.openjdk.java.net/%7Emcimadamore/panama/foreign-jmh.txt>
>     >>     >> [2] - https://github.com/jnr/jnr-ffi
>     >>     >> [3] - https://github.com/jnr/jffi
>     >>     >> [4] - https://sourceware.org/libffi/
>     >>     >> [5] -
>     >>     >>
>     >>
>     https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
>     >>
>     >>     >>
>     >>     >> [6] -
>     >>     >>
>     >>
>     http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
>     <http://cr.openjdk.java.net/%7Ejrose/panama/native-call-primitive.html>
>     >>
>     <http://cr.openjdk.java.net/%7Ejrose/panama/native-call-primitive.html>
>     >>     >> [7] -
>     http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354
>     >>     >>
>     >>     >> (**) the benchmark also contains a 5th row in which I
>     repeated
>     >>     same
>     >>     >> tests, this time using JNR [2]. JNR is built on top of
>     libjffi
>     >>     [3], a
>     >>     >> JNI library in turn built on top of the popular libffi
>     [4]. I
>     >>     wanted
>     >>     >> to have some numbers about JNR because that's another
>     solution
>     >>     that
>     >>     >> allows for better ease to use, taking care of
>     marshalling Java
>     >>     values
>     >>     >> into C and back; since the goals of JNR are similar in
>     spirit
>     >> with
>     >>     >> some of the goals of the Panama/foreign work, I thought it
>     >>     would be
>     >>     >> worth having a comparison of these approaches. For the
>     records, I
>     >>     >> think the JNR numbers are very respectable given that
>     JNR had
>     >>     to do
>     >>     >> all the hard work outside of the JDK!
>     >>     >>
>     >>     >>
>     >>     >
>     >>
>     >
>