[foreign] some JMH benchmarks

Mon Sep 17 15:18:39 UTC 2018

On 17/09/18 15:08, Samuel Audet wrote:
> Yes, the blackhole or the random number doesn't make any difference, 
> but not calling gcc with -O3 does. Running the compiler with 
> optimizations on is pretty common, but they are not enabled by default.
A bit better

PanamaBenchmark.testMethod  thrpt    5  28018170.076 ± 8491668.248 ops/s

But not much of a difference (I did not expected much, as the body of 
the native method is extremely simple).

Maurizio
>
> Samuel
>
>
> 2018年9月17日(月) 21:37 Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>>:
>
>     Hi Samuel,
>     I was planning to upload the benchmark IDE project in the near
>     future (I
>     need to clean that up a bit, so that it can be opened at ease):
>
>     My getpid example is like this; this is the Java decl:
>
>     public class GetPid {
>
>          static {
>              System.loadLibrary("getpid");
>          }
>
>          native static long getpid();
>
>          native double exp(double base);
>     }
>
>     This is the JNI code:
>
>     JNIEXPORT jlong JNICALL Java_org_panama_GetPid_getpid
>        (JNIEnv *env, jobject recv) {
>         return getpid();
>     }
>
>     JNIEXPORT jdouble JNICALL Java_org_panama_GetPid_exp
>        (JNIEnv *env, jobject recv, jdouble arg) {
>         return exp(arg);
>     }
>
>     And this is the benchmark:
>
>     class PanamaBenchmark {
>          static GetPid pidlib = new GetPid();
>
>          @Benchmark
>          public long testJNIPid() {
>              return pidlib.getpid();
>          }
>
>          @Benchmark
>          public double testJNIExp() {
>              return pidlib.exp(10d);
>          }
>     }
>
>
>     I think this should be rather standard?
>
>     I'm on Ubuntu 16.04.1, and using GCC 5.4.0. The command I use to
>     compile
>     the C lib is this:
>
>     gcc -I<path to jni,h> -l<path to jni lib> -shared -o libgetpid.so
>     -fPIC
>     GetPid.c
>
>     One difference I see between our two examples is the use of
>     BlackHole.
>     In my bench, I'm just returning the call to 'exp' - which should be
>     equivalent, and, actually, preferred, as described here:
>
>     http://hg.openjdk.java.net/code-tools/jmh/file/3769055ad883/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_09_Blackholes.java#l51
>
>     Another minor difference I see is that I pass a constant argument,
>     while
>     you generate a random number on each iteration.
>
>     I tried to cut and paste your benchmark and I got this:
>
>     Benchmark                    Mode  Cnt         Score Error  Units
>     PanamaBenchmark.testMethod  thrpt    5  26362701.827 ±
>     1357012.981  ops/s
>
>
>     Which looks exactly the same as what I've got. So, for whatever
>     reason,
>     my machine seems to be slower than the one you are using. For what is
>     worth, this website [1] seems to confirm the difference. While clock
>     speeds are similar, your machine has more Ghz in Turbo boost mode and
>     it's 3-4 years newer than mine, so I'd expect that to make a
>     difference
>     in terms of internal optimizations etc. Note that I'm able to beat
>     the
>     numbers of my workstation using my laptop which sports a slightly
>     higher
>     frequency and only has 2 cores and 8G of RAM.
>
>     [1] -
>     https://www.cpubenchmark.net/compare/Intel-Xeon-E5-2673-v4-vs-Intel-Xeon-E5-2665/2888vs1439
>
>     Maurizio
>
>
>
>     On 17/09/18 11:00, Samuel Audet wrote:
>     > Thanks for the figures Maurizio! It's finally good to be
>     speaking in
>     > numbers. :)
>     >
>     > However, you're not providing a lot of details about how you
>     actually
>     > ran the experiments. So I've decided to run a JMH benchmark on
>     what we
>     > get by default with JavaCPP and this declaration:
>     >
>     >     @Platform(include = "math.h")
>     >     public class MyBenchmark {
>     >         static { Loader.load(); }
>     >
>     >         @NoException
>     >         public static native double exp(double x);
>     >
>     >         @State(Scope.Thread)
>     >         public static class MyState {
>     >             double x;
>     >
>     >             @Setup(Level.Iteration)
>     >             public void setupMethod() {
>     >                 x = Math.random();
>     >             }
>     >         }
>     >
>     >         @Benchmark
>     >         public void testMethod(MyState s, Blackhole bh) {
>     >             bh.consume(exp(s.x));
>     >         }
>     >     }
>     >
>     > The relevant portion of generated JNI looks like this:
>     >
>     >     JNIEXPORT jdouble JNICALL
>     Java_org_sample_MyBenchmark_exp(JNIEnv*
>     > env, jclass cls, jdouble arg0) {
>     >         jdouble rarg = 0;
>     >         double rval = exp(arg0);
>     >         rarg = (jdouble)rval;
>     >         return rarg;
>     >     }
>     >
>     > And with access to just 2 virtual cores of an Intel(R) Xeon(R) CPU
>     > E5-2673 v4 @ 2.30GHz and 8 GB of RAM on the cloud (so probably
>     slower
>     > than your E5-2665 @ 2.40GHz) running Ubuntu 14.04 with GCC 4.9 and
>     > OpenJDK 8, I get these numbers:
>     > Benchmark                Mode  Cnt Score        Error Units
>     > MyBenchmark.testMethod  thrpt   25  37183556.094 ± 460795.746 ops/s
>     >
>     > I'm not sure how that compares with your numbers exactly, but it
>     does
>     > seem to me that what you get for JNI is a bit low. If you could
>     > provide more details about how to reproduce your results, that
>     would
>     > be great.
>     >
>     > Samuel
>     >
>     >
>     > On 09/14/2018 10:19 PM, Maurizio Cimadamore wrote:
>     >> Hi,
>     >> over the last few weeks I've been busy playing with Panama and
>     >> assessing performances with JMH. For those just interested in raw
>     >> numbers, the results of my explorations can be found here [1].
>     But as
>     >> all benchmarks, I think it's better to spend few words to
>     understand
>     >> what these numbers actually _mean_.
>     >>
>     >> To evaluate the performances of Panama I have first created a
>     >> baseline using JNI - more specifically I wanted to assess
>     >> performances of three calls (all part of the C std library),
>     namely
>     >> `getpid`, `exp` and `qsort`.
>     >>
>     >> The first example is the de facto benchmark for FFIs - since it
>     does
>     >> relatively little computation, it is a good test to measure the
>     >> 'latency' of the FFI approach (e.g. how long does it take to go to
>     >> native). The second example is also relatively simple, but the
>     this
>     >> time the function takes a double argument. The third test is
>     akin to
>     >> an FFI torture test, since not only it passes substantially more
>     >> arguments (4) but one of these arguments is also a callback - a
>     >> pointer to a function that is used to sort the contents of the
>     input
>     >> array.
>     >>
>     >> As expected, the first batch of JNI results confirms our
>     >> expectations: `getpid` is the fastest, followed by `exp`, and then
>     >> followed by `qsort`. Note that qsort is not even close in terms of
>     >> raw numbers to the other two tests - that's because, to sort the
>     >> array we need to do (N * log N) upcalls into Java. In the
>     benchmark,
>     >> N = 8 and we do the upcalls using the JNI function
>     >> JNIEnv::CallIntMethod.
>     >>
>     >> Now let's examine the second batch of results; these call
>     `getpid`,
>     >> `exp` and `qsort` using Panama. The numbers here are considerably
>     >> lower than the JNI ones for all the three benchmark - although the
>     >> first two seem to be the most problematic. To explain these
>     results
>     >> we need to peek under the hood. Panama implements foreign calls
>     >> through a so called 'universal adapter' which, given a calling
>     scheme
>     >> and a bunch of arguments (machine words) shuffles these
>     arguments in
>     >> the right registers/stack slot and then jumps to the target native
>     >> function - after which another round of adaptation must be
>     performed
>     >> (e.g. to recover the return value from the right register/memory
>     >> location).
>     >>
>     >> Needless to say, all this generality comes at a cost - some of the
>     >> cost is in Java - e.g. all arguments have to be packaged up into a
>     >> long array (although this component doesn't seem to show up
>     much in
>     >> the generated JVM compiled code). A lot of the cost is in the
>     adapter
>     >> logic itself - which has to look at the 'call recipe' and move
>     >> arguments around accordingly - more specifically, in order to call
>     >> the native function, the adapter creates a bunch of helper C++
>     >> objects and structs which model the CPU state (e.g. in the
>     >> ShuffleDowncallContext struct, we find a field for each
>     register to
>     >> be modeled in the target architecture). The adapter has to
>     first move
>     >> the values coming from the Java world (stored in the
>     aforementioned
>     >> long array) into the right context fields (and it needs to do
>     so by
>     >> looking at the recipe, which involves iteration over the recipe
>     >> elements). After that's done, we can jump into the assembly
>     stub that
>     >> does the native call - this stub will take as input one of those
>     >> ShuffleDowncallContext structure and will load the corresponding
>     >> registers/create necessary stack slots ahead of the call.
>     >>
>     >> As you can see, there's quite a lot of action going on here,
>     and this
>     >> explains the benchmark numbers; of course, if you are calling a
>     >> native function that does a lot of computation, this adaptation
>     cost
>     >> will wash out - but for relatively quick calls such as 'getpid'
>     and
>     >> 'exp' the latency dominates the picture.
>     >>
>     >> Digression: the callback logic suffers pretty much from the same
>     >> issues, albeit in a reversed order - this time it's the Java code
>     >> which receives a 'snapshot' of the register values from a
>     generated
>     >> assembly adapter; the Java code can then read such values
>     (using the
>     >> Pointer API), turn them into Java objects, call the target Java
>     >> method and store the results (after another conversion) in the
>     right
>     >> location of the snapshot. The assembly adapter will then pick
>     up the
>     >> values set onto the snapshot by the Java code, store them into the
>     >> corresponding registers and return control to the native
>     callee. In
>     >> the remainder of this email we will not discuss callbacks in
>     details
>     >> - we will just posit that for any optimization technique that
>     can be
>     >> defined, there exists a _dual_ strategy that works with callbacks.
>     >>
>     >> How can we make sensible native calls go faster? Well, one obvious
>     >> way would be to optimize the universal adapter so that we get a
>     >> specialized assembly stub for each code shape. If we do that,
>     we can
>     >> move pretty much all of the computation described above from
>     >> execution time to the stub generation time, so that, by the
>     time we
>     >> have to call the native function, we just have to populate the
>     right
>     >> registers (the specialized stub knows where to find them) and
>     jump.
>     >> While this sounds a good approach, it feels like there's also a
>     move
>     >> for the JIT somewhere in there - after all, the JVM knows which
>     calls
>     >> are hot and in need for optimization, so perhaps this
>     specialization
>     >> process (some or all of it) could happen dynamically. And this is
>     >> indeed an approach we'd like to aim for in the long run.
>     >>
>     >> Now, few years ago, Vlad put together a patch which now lives
>     in the
>     >> 'linkToNative' branch [6, 7] - the goal of this patch is to
>     implement
>     >> the approach described above: generate a specialized assembly
>     adapter
>     >> for a given native signature, and then leverage the JIT to
>     optimize
>     >> it away, turning the adapter into a bare, direct, native method
>     call.
>     >> As you can see from the third batch of benchmarks, if we tweak
>     Panama
>     >> to use the linkToNative machinery, the speed up is really
>     impressive,
>     >> and we end up being much faster than JNI (up to 4x for getPid).
>     >>
>     >> Unfortunately, the technology in the linkToNative branch is not
>     ready
>     >> from prime time (yet) - first, it doesn't cover some useful cases
>     >> (e.g. varargs, multiple returns via registers, arguments passed in
>     >> memory). That is, the logic assumes there's a 1-1 mapping
>     between a
>     >> Java signature and the native function to be called - and that the
>     >> arguments passed from Java will either be longs or doubles.
>     While we
>     >> can workaround this limitation and define the necessary
>     marshalling
>     >> logic in Java (as I have done to run this benchmark), some of the
>     >> limitations (multiple returns, structs passed by value which
>     are too
>     >> big) cannot simply be worked around. But that's fine, we can still
>     >> have a fast path for those calls which have certain
>     characteristics
>     >> and a slow path (through the universal adapter) for all the
>     other calls.
>     >>
>     >> But there's a second and more serious issue lurking: as you can
>     see
>     >> in the benchmark, I was not able to get the qsort benchmark
>     running
>     >> when using the linkToNative backend. The reason is that the
>     >> linkToNative code is still pretty raw, and it doesn't fully
>     adhere to
>     >> the JVM internal conventions - e.g. there are missing thread state
>     >> transitions which, in the case of upcalls into Java, create issues
>     >> when it comes to garbage collection, as the GC cannot parse the
>     >> native stack in the correct way.
>     >>
>     >> This means that, while there's a clear shining path ahead of
>     us, it
>     >> is simply too early to just use the linkToNative backend from
>     Panama.
>     >> For this reason, I've been looking into some kind of stopgap
>     solution
>     >> - another way of optimizing native calls (and upcalls into
>     Java) that
>     >> doesn't require too much VM magic. Now, a crucial observation is
>     >> that, in many native calls, there is indeed a 1-1 mapping between
>     >> Java arguments and native arguments (and back, for return values).
>     >> That is, we can think of calling a native function as a process
>     that
>     >> takes a bunch of Java arguments, turn them into native arguments
>     >> (either double or longs), calls the native methods and then turns
>     >> back the result into Java.
>     >>
>     >> The mapping between Java arguments and C values is quite simple:
>     >>
>     >> * primitives: either long or double, depending on whether they
>     >> describe an integral value or a floating point one.
>     >> * pointers: they convert to a long
>     >> * callbacks: they also convert to a long
>     >> * structs: they are recursively decomposed into fields and each
>     field
>     >> is marshalled separately (assuming the struct is not too big, in
>     >> which case is passed in memory)
>     >>
>     >> So, in principle, we could define a bunch of native entry
>     points in
>     >> the VM, one per shape, which take a bunch of long and doubles and
>     >> call an underlying function with those arguments. For instance,
>     let's
>     >> consider the case of a native function which is modelled in
>     Java as:
>     >>
>     >> int m(Pointer<Foo>, double)
>     >>
>     >> To call this native function we have to first turn the Java
>     arguments
>     >> into a (long, double) pair. Then we need to call a native adapter
>     >> that looks like the following:
>     >>
>     >> jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused, jlong
>     addr,
>     >> jlong arg0, jdouble arg1) {
>     >>      return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
>     >> }
>     >>
>     >> And this will take care of calling the native function and
>     returning
>     >> the value back. This is, admittedly, a very simple solution; of
>     >> course there are limitations: we have to define a bunch of
>     >> specialized native entry point (and Java entry points, for
>     >> callbacks). But here we can play a trick: most of moderns ABI pass
>     >> arguments in registers; for instance System V ABI [5] uses up to 6
>     >> (!!) integer registers and 7 (!!) MMXr registers for FP values
>     - this
>     >> gives us a total of 13 registers available for argument passing.
>     >> Which covers quite a lot of cases. Now, if we have a call where
>     _all_
>     >> arguments are passed in registers, then the order in which these
>     >> arguments are declared in the adapter doesn't matter! That is,
>     since
>     >> FP-values will always be passed in different register from
>     integral
>     >> values, we can just define entry points which look like these:
>     >>
>     >> invokeNative_V_DDDDD
>     >> invokeNative_V_JDDDD
>     >> invokeNative_V_JJDDD
>     >> invokeNative_V_JJJDD
>     >> invokeNative_V_JJJJD
>     >> invokeNative_V_JJJJJ
>     >>
>     >> That is, for a given arity (5 in this case), we can just put
>     all long
>     >> arguments in front, and the double arguments after that. That
>     is, we
>     >> don't need to generate all possible permutations of J/D in all
>     >> positions - as the adapter will always do the same thing (read:
>     load
>     >> from same registers) for all equivalent combinations. This
>     keeps the
>     >> number of entry points in check - and it also poses some
>     challenges
>     >> to the Java logic in charge of marshalling/unmarshalling, as
>     there's
>     >> an extra permutation step involved (although that is not something
>     >> super-hard to address).
>     >>
>     >> You can see the performance numbers associated with this
>     invocation
>     >> scheme (which I've dubbed 'direct') in the 4th batch of the
>     benchmark
>     >> results. These numbers are on par (and slightly better) with
>     JNI in
>     >> all the three cases considered which is, I think, a very positive
>     >> result, given that to write these benchmarks I did not have to
>     write
>     >> a single line of JNI code. In other words, this optimization gives
>     >> you the same speed as JNI, with improved ease of use (**).
>     >>
>     >> Now, since the 'direct' optimization builds on top of the VM
>     native
>     >> call adapters, this approach is significantly more robust than
>     >> linkToNative and I have not run into any weird VM crashes when
>     >> playing with it. The downside of that, is that, for obvious
>     reasons,
>     >> this approach cannot get much faster than JNI - that is, it cannot
>     >> get close to the numbers obtained with the linkToNative backend,
>     >> which features much deeper optimizations. But I think that,
>     despite
>     >> its limitations, it's still a good opportunistic improvement
>     that is
>     >> worth pursuing in the short term (while we sort out the
>     linkToNative
>     >> story). For this reason, I will soon be submitting a review which
>     >> incorporates the changes for the 'direct' invocation schemes.
>     >>
>     >> Cheers
>     >> Maurizio
>     >>
>     >> [1] -
>     http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>     <http://cr.openjdk.java.net/%7Emcimadamore/panama/foreign-jmh.txt>
>     >> [2] - https://github.com/jnr/jnr-ffi
>     >> [3] - https://github.com/jnr/jffi
>     >> [4] - https://sourceware.org/libffi/
>     >> [5] -
>     >>
>     https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
>
>     >>
>     >> [6] -
>     >>
>     http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
>     <http://cr.openjdk.java.net/%7Ejrose/panama/native-call-primitive.html>
>     >> [7] - http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354
>     >>
>     >> (**) the benchmark also contains a 5th row in which I repeated
>     same
>     >> tests, this time using JNR [2]. JNR is built on top of libjffi
>     [3], a
>     >> JNI library in turn built on top of the popular libffi [4]. I
>     wanted
>     >> to have some numbers about JNR because that's another solution
>     that
>     >> allows for better ease to use, taking care of marshalling Java
>     values
>     >> into C and back; since the goals of JNR are similar in spirit with
>     >> some of the goals of the Panama/foreign work, I thought it
>     would be
>     >> worth having a comparison of these approaches. For the records, I
>     >> think the JNR numbers are very respectable given that JNR had
>     to do
>     >> all the hard work outside of the JDK!
>     >>
>     >>
>     >
>