[foreign] some JMH benchmarks

Mon Sep 17 15:58:59 UTC 2018

For the records, here's what I get for all the three benchmarks if I 
compile the JNI code with -O3:

Benchmark                          Mode  Cnt Score         Error  Units
PanamaBenchmark.testJNIExp        thrpt    5  28575269.294 ± 
1907726.710  ops/s
PanamaBenchmark.testJNIJavaQsort  thrpt    5    372148.433 ± 27178.529  
ops/s
PanamaBenchmark.testJNIPid        thrpt    5  59240069.011 ± 403881.697  
ops/s

The first and second benchmarks get faster and very close to the 
'direct' optimization numbers in [1]. Surprisingly, the last benchmark 
(getpid) is quite slower. I've been able to reproduce across multiple 
runs; for that benchmark omitting O3 seems to be the achieve best 
results, not sure why. It starts of faster (around in the first couple 
of warmup iterations, but then it goes slower in all the other runs - 
presumably it interacts badly with the C2 generated code. For instance, 
this is a run with O3 enabled:

# Run progress: 66.67% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 65182202.653 ops/s
# Warmup Iteration   2: 64900639.094 ops/s
# Warmup Iteration   3: 59314945.437 ops/s 
<---------------------------------
# Warmup Iteration   4: 59269007.877 ops/s
# Warmup Iteration   5: 59239905.163 ops/s
Iteration   1: 59300748.074 ops/s
Iteration   2: 59249666.044 ops/s
Iteration   3: 59268597.051 ops/s
Iteration   4: 59322074.572 ops/s
Iteration   5: 59059259.317 ops/s

And this is a run with O3 disabled:

# Run progress: 0.00% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 55882128.787 ops/s
# Warmup Iteration   2: 53102361.751 ops/s
# Warmup Iteration   3: 66964755.699 ops/s 
<---------------------------------
# Warmup Iteration   4: 66414428.355 ops/s
# Warmup Iteration   5: 65328475.276 ops/s
Iteration   1: 64229192.993 ops/s
Iteration   2: 65191719.319 ops/s
Iteration   3: 65352022.471 ops/s
Iteration   4: 65152090.426 ops/s
Iteration   5: 65320545.712 ops/s

In both cases, the 3rd warmup execution sees a performance jump - with 
O3, the jump is backwards, w/o O3 the jump is forward, which is quite 
typical for a JMH benchmark as C2 optimization will start to kick in.

For these reasons, I'm reluctant to update my benchmark numbers to 
reflect the O3 behavior (although I agree that, since the Hotspot code 
is compiled with that optimization it would make more sense to use that 
as a reference).

Maurizio

[1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt

On 17/09/18 16:18, Maurizio Cimadamore wrote:
>
>
> On 17/09/18 15:08, Samuel Audet wrote:
>> Yes, the blackhole or the random number doesn't make any difference, 
>> but not calling gcc with -O3 does. Running the compiler with 
>> optimizations on is pretty common, but they are not enabled by default.
> A bit better
>
> PanamaBenchmark.testMethod  thrpt    5  28018170.076 ± 8491668.248 ops/s
>
> But not much of a difference (I did not expected much, as the body of 
> the native method is extremely simple).
>
> Maurizio
>>
>> Samuel
>>
>>
>> 2018年9月17日(月) 21:37 Maurizio Cimadamore 
>> <maurizio.cimadamore at oracle.com 
>> <mailto:maurizio.cimadamore at oracle.com>>:
>>
>>     Hi Samuel,
>>     I was planning to upload the benchmark IDE project in the near
>>     future (I
>>     need to clean that up a bit, so that it can be opened at ease):
>>
>>     My getpid example is like this; this is the Java decl:
>>
>>     public class GetPid {
>>
>>          static {
>>              System.loadLibrary("getpid");
>>          }
>>
>>          native static long getpid();
>>
>>          native double exp(double base);
>>     }
>>
>>     This is the JNI code:
>>
>>     JNIEXPORT jlong JNICALL Java_org_panama_GetPid_getpid
>>        (JNIEnv *env, jobject recv) {
>>         return getpid();
>>     }
>>
>>     JNIEXPORT jdouble JNICALL Java_org_panama_GetPid_exp
>>        (JNIEnv *env, jobject recv, jdouble arg) {
>>         return exp(arg);
>>     }
>>
>>     And this is the benchmark:
>>
>>     class PanamaBenchmark {
>>          static GetPid pidlib = new GetPid();
>>
>>          @Benchmark
>>          public long testJNIPid() {
>>              return pidlib.getpid();
>>          }
>>
>>          @Benchmark
>>          public double testJNIExp() {
>>              return pidlib.exp(10d);
>>          }
>>     }
>>
>>
>>     I think this should be rather standard?
>>
>>     I'm on Ubuntu 16.04.1, and using GCC 5.4.0. The command I use to
>>     compile
>>     the C lib is this:
>>
>>     gcc -I<path to jni,h> -l<path to jni lib> -shared -o libgetpid.so
>>     -fPIC
>>     GetPid.c
>>
>>     One difference I see between our two examples is the use of
>>     BlackHole.
>>     In my bench, I'm just returning the call to 'exp' - which should be
>>     equivalent, and, actually, preferred, as described here:
>>
>> http://hg.openjdk.java.net/code-tools/jmh/file/3769055ad883/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_09_Blackholes.java#l51
>>
>>     Another minor difference I see is that I pass a constant argument,
>>     while
>>     you generate a random number on each iteration.
>>
>>     I tried to cut and paste your benchmark and I got this:
>>
>>     Benchmark                    Mode  Cnt         Score Error Units
>>     PanamaBenchmark.testMethod  thrpt    5  26362701.827 ±
>>     1357012.981  ops/s
>>
>>
>>     Which looks exactly the same as what I've got. So, for whatever
>>     reason,
>>     my machine seems to be slower than the one you are using. For 
>> what is
>>     worth, this website [1] seems to confirm the difference. While clock
>>     speeds are similar, your machine has more Ghz in Turbo boost mode 
>> and
>>     it's 3-4 years newer than mine, so I'd expect that to make a
>>     difference
>>     in terms of internal optimizations etc. Note that I'm able to beat
>>     the
>>     numbers of my workstation using my laptop which sports a slightly
>>     higher
>>     frequency and only has 2 cores and 8G of RAM.
>>
>>     [1] -
>> https://www.cpubenchmark.net/compare/Intel-Xeon-E5-2673-v4-vs-Intel-Xeon-E5-2665/2888vs1439
>>
>>     Maurizio
>>
>>
>>
>>     On 17/09/18 11:00, Samuel Audet wrote:
>>     > Thanks for the figures Maurizio! It's finally good to be
>>     speaking in
>>     > numbers. :)
>>     >
>>     > However, you're not providing a lot of details about how you
>>     actually
>>     > ran the experiments. So I've decided to run a JMH benchmark on
>>     what we
>>     > get by default with JavaCPP and this declaration:
>>     >
>>     >     @Platform(include = "math.h")
>>     >     public class MyBenchmark {
>>     >         static { Loader.load(); }
>>     >
>>     >         @NoException
>>     >         public static native double exp(double x);
>>     >
>>     >         @State(Scope.Thread)
>>     >         public static class MyState {
>>     >             double x;
>>     >
>>     >             @Setup(Level.Iteration)
>>     >             public void setupMethod() {
>>     >                 x = Math.random();
>>     >             }
>>     >         }
>>     >
>>     >         @Benchmark
>>     >         public void testMethod(MyState s, Blackhole bh) {
>>     >             bh.consume(exp(s.x));
>>     >         }
>>     >     }
>>     >
>>     > The relevant portion of generated JNI looks like this:
>>     >
>>     >     JNIEXPORT jdouble JNICALL
>>     Java_org_sample_MyBenchmark_exp(JNIEnv*
>>     > env, jclass cls, jdouble arg0) {
>>     >         jdouble rarg = 0;
>>     >         double rval = exp(arg0);
>>     >         rarg = (jdouble)rval;
>>     >         return rarg;
>>     >     }
>>     >
>>     > And with access to just 2 virtual cores of an Intel(R) Xeon(R) CPU
>>     > E5-2673 v4 @ 2.30GHz and 8 GB of RAM on the cloud (so probably
>>     slower
>>     > than your E5-2665 @ 2.40GHz) running Ubuntu 14.04 with GCC 4.9 and
>>     > OpenJDK 8, I get these numbers:
>>     > Benchmark                Mode  Cnt Score        Error Units
>>     > MyBenchmark.testMethod  thrpt   25  37183556.094 ± 460795.746 
>> ops/s
>>     >
>>     > I'm not sure how that compares with your numbers exactly, but it
>>     does
>>     > seem to me that what you get for JNI is a bit low. If you could
>>     > provide more details about how to reproduce your results, that
>>     would
>>     > be great.
>>     >
>>     > Samuel
>>     >
>>     >
>>     > On 09/14/2018 10:19 PM, Maurizio Cimadamore wrote:
>>     >> Hi,
>>     >> over the last few weeks I've been busy playing with Panama and
>>     >> assessing performances with JMH. For those just interested in raw
>>     >> numbers, the results of my explorations can be found here [1].
>>     But as
>>     >> all benchmarks, I think it's better to spend few words to
>>     understand
>>     >> what these numbers actually _mean_.
>>     >>
>>     >> To evaluate the performances of Panama I have first created a
>>     >> baseline using JNI - more specifically I wanted to assess
>>     >> performances of three calls (all part of the C std library),
>>     namely
>>     >> `getpid`, `exp` and `qsort`.
>>     >>
>>     >> The first example is the de facto benchmark for FFIs - since it
>>     does
>>     >> relatively little computation, it is a good test to measure the
>>     >> 'latency' of the FFI approach (e.g. how long does it take to 
>> go to
>>     >> native). The second example is also relatively simple, but the
>>     this
>>     >> time the function takes a double argument. The third test is
>>     akin to
>>     >> an FFI torture test, since not only it passes substantially more
>>     >> arguments (4) but one of these arguments is also a callback - a
>>     >> pointer to a function that is used to sort the contents of the
>>     input
>>     >> array.
>>     >>
>>     >> As expected, the first batch of JNI results confirms our
>>     >> expectations: `getpid` is the fastest, followed by `exp`, and 
>> then
>>     >> followed by `qsort`. Note that qsort is not even close in 
>> terms of
>>     >> raw numbers to the other two tests - that's because, to sort the
>>     >> array we need to do (N * log N) upcalls into Java. In the
>>     benchmark,
>>     >> N = 8 and we do the upcalls using the JNI function
>>     >> JNIEnv::CallIntMethod.
>>     >>
>>     >> Now let's examine the second batch of results; these call
>>     `getpid`,
>>     >> `exp` and `qsort` using Panama. The numbers here are considerably
>>     >> lower than the JNI ones for all the three benchmark - although 
>> the
>>     >> first two seem to be the most problematic. To explain these
>>     results
>>     >> we need to peek under the hood. Panama implements foreign calls
>>     >> through a so called 'universal adapter' which, given a calling
>>     scheme
>>     >> and a bunch of arguments (machine words) shuffles these
>>     arguments in
>>     >> the right registers/stack slot and then jumps to the target 
>> native
>>     >> function - after which another round of adaptation must be
>>     performed
>>     >> (e.g. to recover the return value from the right register/memory
>>     >> location).
>>     >>
>>     >> Needless to say, all this generality comes at a cost - some of 
>> the
>>     >> cost is in Java - e.g. all arguments have to be packaged up 
>> into a
>>     >> long array (although this component doesn't seem to show up
>>     much in
>>     >> the generated JVM compiled code). A lot of the cost is in the
>>     adapter
>>     >> logic itself - which has to look at the 'call recipe' and move
>>     >> arguments around accordingly - more specifically, in order to 
>> call
>>     >> the native function, the adapter creates a bunch of helper C++
>>     >> objects and structs which model the CPU state (e.g. in the
>>     >> ShuffleDowncallContext struct, we find a field for each
>>     register to
>>     >> be modeled in the target architecture). The adapter has to
>>     first move
>>     >> the values coming from the Java world (stored in the
>>     aforementioned
>>     >> long array) into the right context fields (and it needs to do
>>     so by
>>     >> looking at the recipe, which involves iteration over the recipe
>>     >> elements). After that's done, we can jump into the assembly
>>     stub that
>>     >> does the native call - this stub will take as input one of those
>>     >> ShuffleDowncallContext structure and will load the corresponding
>>     >> registers/create necessary stack slots ahead of the call.
>>     >>
>>     >> As you can see, there's quite a lot of action going on here,
>>     and this
>>     >> explains the benchmark numbers; of course, if you are calling a
>>     >> native function that does a lot of computation, this adaptation
>>     cost
>>     >> will wash out - but for relatively quick calls such as 'getpid'
>>     and
>>     >> 'exp' the latency dominates the picture.
>>     >>
>>     >> Digression: the callback logic suffers pretty much from the same
>>     >> issues, albeit in a reversed order - this time it's the Java code
>>     >> which receives a 'snapshot' of the register values from a
>>     generated
>>     >> assembly adapter; the Java code can then read such values
>>     (using the
>>     >> Pointer API), turn them into Java objects, call the target Java
>>     >> method and store the results (after another conversion) in the
>>     right
>>     >> location of the snapshot. The assembly adapter will then pick
>>     up the
>>     >> values set onto the snapshot by the Java code, store them into 
>> the
>>     >> corresponding registers and return control to the native
>>     callee. In
>>     >> the remainder of this email we will not discuss callbacks in
>>     details
>>     >> - we will just posit that for any optimization technique that
>>     can be
>>     >> defined, there exists a _dual_ strategy that works with 
>> callbacks.
>>     >>
>>     >> How can we make sensible native calls go faster? Well, one 
>> obvious
>>     >> way would be to optimize the universal adapter so that we get a
>>     >> specialized assembly stub for each code shape. If we do that,
>>     we can
>>     >> move pretty much all of the computation described above from
>>     >> execution time to the stub generation time, so that, by the
>>     time we
>>     >> have to call the native function, we just have to populate the
>>     right
>>     >> registers (the specialized stub knows where to find them) and
>>     jump.
>>     >> While this sounds a good approach, it feels like there's also a
>>     move
>>     >> for the JIT somewhere in there - after all, the JVM knows which
>>     calls
>>     >> are hot and in need for optimization, so perhaps this
>>     specialization
>>     >> process (some or all of it) could happen dynamically. And this is
>>     >> indeed an approach we'd like to aim for in the long run.
>>     >>
>>     >> Now, few years ago, Vlad put together a patch which now lives
>>     in the
>>     >> 'linkToNative' branch [6, 7] - the goal of this patch is to
>>     implement
>>     >> the approach described above: generate a specialized assembly
>>     adapter
>>     >> for a given native signature, and then leverage the JIT to
>>     optimize
>>     >> it away, turning the adapter into a bare, direct, native method
>>     call.
>>     >> As you can see from the third batch of benchmarks, if we tweak
>>     Panama
>>     >> to use the linkToNative machinery, the speed up is really
>>     impressive,
>>     >> and we end up being much faster than JNI (up to 4x for getPid).
>>     >>
>>     >> Unfortunately, the technology in the linkToNative branch is not
>>     ready
>>     >> from prime time (yet) - first, it doesn't cover some useful cases
>>     >> (e.g. varargs, multiple returns via registers, arguments 
>> passed in
>>     >> memory). That is, the logic assumes there's a 1-1 mapping
>>     between a
>>     >> Java signature and the native function to be called - and that 
>> the
>>     >> arguments passed from Java will either be longs or doubles.
>>     While we
>>     >> can workaround this limitation and define the necessary
>>     marshalling
>>     >> logic in Java (as I have done to run this benchmark), some of the
>>     >> limitations (multiple returns, structs passed by value which
>>     are too
>>     >> big) cannot simply be worked around. But that's fine, we can 
>> still
>>     >> have a fast path for those calls which have certain
>>     characteristics
>>     >> and a slow path (through the universal adapter) for all the
>>     other calls.
>>     >>
>>     >> But there's a second and more serious issue lurking: as you can
>>     see
>>     >> in the benchmark, I was not able to get the qsort benchmark
>>     running
>>     >> when using the linkToNative backend. The reason is that the
>>     >> linkToNative code is still pretty raw, and it doesn't fully
>>     adhere to
>>     >> the JVM internal conventions - e.g. there are missing thread 
>> state
>>     >> transitions which, in the case of upcalls into Java, create 
>> issues
>>     >> when it comes to garbage collection, as the GC cannot parse the
>>     >> native stack in the correct way.
>>     >>
>>     >> This means that, while there's a clear shining path ahead of
>>     us, it
>>     >> is simply too early to just use the linkToNative backend from
>>     Panama.
>>     >> For this reason, I've been looking into some kind of stopgap
>>     solution
>>     >> - another way of optimizing native calls (and upcalls into
>>     Java) that
>>     >> doesn't require too much VM magic. Now, a crucial observation is
>>     >> that, in many native calls, there is indeed a 1-1 mapping between
>>     >> Java arguments and native arguments (and back, for return 
>> values).
>>     >> That is, we can think of calling a native function as a process
>>     that
>>     >> takes a bunch of Java arguments, turn them into native arguments
>>     >> (either double or longs), calls the native methods and then turns
>>     >> back the result into Java.
>>     >>
>>     >> The mapping between Java arguments and C values is quite simple:
>>     >>
>>     >> * primitives: either long or double, depending on whether they
>>     >> describe an integral value or a floating point one.
>>     >> * pointers: they convert to a long
>>     >> * callbacks: they also convert to a long
>>     >> * structs: they are recursively decomposed into fields and each
>>     field
>>     >> is marshalled separately (assuming the struct is not too big, in
>>     >> which case is passed in memory)
>>     >>
>>     >> So, in principle, we could define a bunch of native entry
>>     points in
>>     >> the VM, one per shape, which take a bunch of long and doubles and
>>     >> call an underlying function with those arguments. For instance,
>>     let's
>>     >> consider the case of a native function which is modelled in
>>     Java as:
>>     >>
>>     >> int m(Pointer<Foo>, double)
>>     >>
>>     >> To call this native function we have to first turn the Java
>>     arguments
>>     >> into a (long, double) pair. Then we need to call a native adapter
>>     >> that looks like the following:
>>     >>
>>     >> jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused, jlong
>>     addr,
>>     >> jlong arg0, jdouble arg1) {
>>     >>      return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
>>     >> }
>>     >>
>>     >> And this will take care of calling the native function and
>>     returning
>>     >> the value back. This is, admittedly, a very simple solution; of
>>     >> course there are limitations: we have to define a bunch of
>>     >> specialized native entry point (and Java entry points, for
>>     >> callbacks). But here we can play a trick: most of moderns ABI 
>> pass
>>     >> arguments in registers; for instance System V ABI [5] uses up 
>> to 6
>>     >> (!!) integer registers and 7 (!!) MMXr registers for FP values
>>     - this
>>     >> gives us a total of 13 registers available for argument passing.
>>     >> Which covers quite a lot of cases. Now, if we have a call where
>>     _all_
>>     >> arguments are passed in registers, then the order in which these
>>     >> arguments are declared in the adapter doesn't matter! That is,
>>     since
>>     >> FP-values will always be passed in different register from
>>     integral
>>     >> values, we can just define entry points which look like these:
>>     >>
>>     >> invokeNative_V_DDDDD
>>     >> invokeNative_V_JDDDD
>>     >> invokeNative_V_JJDDD
>>     >> invokeNative_V_JJJDD
>>     >> invokeNative_V_JJJJD
>>     >> invokeNative_V_JJJJJ
>>     >>
>>     >> That is, for a given arity (5 in this case), we can just put
>>     all long
>>     >> arguments in front, and the double arguments after that. That
>>     is, we
>>     >> don't need to generate all possible permutations of J/D in all
>>     >> positions - as the adapter will always do the same thing (read:
>>     load
>>     >> from same registers) for all equivalent combinations. This
>>     keeps the
>>     >> number of entry points in check - and it also poses some
>>     challenges
>>     >> to the Java logic in charge of marshalling/unmarshalling, as
>>     there's
>>     >> an extra permutation step involved (although that is not 
>> something
>>     >> super-hard to address).
>>     >>
>>     >> You can see the performance numbers associated with this
>>     invocation
>>     >> scheme (which I've dubbed 'direct') in the 4th batch of the
>>     benchmark
>>     >> results. These numbers are on par (and slightly better) with
>>     JNI in
>>     >> all the three cases considered which is, I think, a very positive
>>     >> result, given that to write these benchmarks I did not have to
>>     write
>>     >> a single line of JNI code. In other words, this optimization 
>> gives
>>     >> you the same speed as JNI, with improved ease of use (**).
>>     >>
>>     >> Now, since the 'direct' optimization builds on top of the VM
>>     native
>>     >> call adapters, this approach is significantly more robust than
>>     >> linkToNative and I have not run into any weird VM crashes when
>>     >> playing with it. The downside of that, is that, for obvious
>>     reasons,
>>     >> this approach cannot get much faster than JNI - that is, it 
>> cannot
>>     >> get close to the numbers obtained with the linkToNative backend,
>>     >> which features much deeper optimizations. But I think that,
>>     despite
>>     >> its limitations, it's still a good opportunistic improvement
>>     that is
>>     >> worth pursuing in the short term (while we sort out the
>>     linkToNative
>>     >> story). For this reason, I will soon be submitting a review which
>>     >> incorporates the changes for the 'direct' invocation schemes.
>>     >>
>>     >> Cheers
>>     >> Maurizio
>>     >>
>>     >> [1] -
>> http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>> <http://cr.openjdk.java.net/%7Emcimadamore/panama/foreign-jmh.txt>
>>     >> [2] - https://github.com/jnr/jnr-ffi
>>     >> [3] - https://github.com/jnr/jffi
>>     >> [4] - https://sourceware.org/libffi/
>>     >> [5] -
>>     >>
>> https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
>>
>>     >>
>>     >> [6] -
>>     >>
>> http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
>> <http://cr.openjdk.java.net/%7Ejrose/panama/native-call-primitive.html>
>>     >> [7] - http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354
>>     >>
>>     >> (**) the benchmark also contains a 5th row in which I repeated
>>     same
>>     >> tests, this time using JNR [2]. JNR is built on top of libjffi
>>     [3], a
>>     >> JNI library in turn built on top of the popular libffi [4]. I
>>     wanted
>>     >> to have some numbers about JNR because that's another solution
>>     that
>>     >> allows for better ease to use, taking care of marshalling Java
>>     values
>>     >> into C and back; since the goals of JNR are similar in spirit 
>> with
>>     >> some of the goals of the Panama/foreign work, I thought it
>>     would be
>>     >> worth having a comparison of these approaches. For the records, I
>>     >> think the JNR numbers are very respectable given that JNR had
>>     to do
>>     >> all the hard work outside of the JDK!
>>     >>
>>     >>
>>     >
>>
>