[foreign] some JMH benchmarks

Mon Sep 17 14:08:27 UTC 2018

Yes, the blackhole or the random number doesn't make any difference, but
not calling gcc with -O3 does. Running the compiler with optimizations on
is pretty common, but they are not enabled by default.

Samuel

2018年9月17日(月) 21:37 Maurizio Cimadamore <maurizio.cimadamore at oracle.com>:

> Hi Samuel,
> I was planning to upload the benchmark IDE project in the near future (I
> need to clean that up a bit, so that it can be opened at ease):
>
> My getpid example is like this; this is the Java decl:
>
> public class GetPid {
>
>      static {
>          System.loadLibrary("getpid");
>      }
>
>      native static long getpid();
>
>      native double exp(double base);
> }
>
> This is the JNI code:
>
> JNIEXPORT jlong JNICALL Java_org_panama_GetPid_getpid
>    (JNIEnv *env, jobject recv) {
>     return getpid();
> }
>
> JNIEXPORT jdouble JNICALL Java_org_panama_GetPid_exp
>    (JNIEnv *env, jobject recv, jdouble arg) {
>     return exp(arg);
> }
>
> And this is the benchmark:
>
> class PanamaBenchmark {
>      static GetPid pidlib = new GetPid();
>
>      @Benchmark
>      public long testJNIPid() {
>          return pidlib.getpid();
>      }
>
>      @Benchmark
>      public double testJNIExp() {
>          return pidlib.exp(10d);
>      }
> }
>
>
> I think this should be rather standard?
>
> I'm on Ubuntu 16.04.1, and using GCC 5.4.0. The command I use to compile
> the C lib is this:
>
> gcc -I<path to jni,h> -l<path to jni lib> -shared -o libgetpid.so -fPIC
> GetPid.c
>
> One difference I see between our two examples is the use of BlackHole.
> In my bench, I'm just returning the call to 'exp' - which should be
> equivalent, and, actually, preferred, as described here:
>
>
> http://hg.openjdk.java.net/code-tools/jmh/file/3769055ad883/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_09_Blackholes.java#l51
>
> Another minor difference I see is that I pass a constant argument, while
> you generate a random number on each iteration.
>
> I tried to cut and paste your benchmark and I got this:
>
> Benchmark                    Mode  Cnt         Score Error  Units
> PanamaBenchmark.testMethod  thrpt    5  26362701.827 ± 1357012.981  ops/s
>
>
> Which looks exactly the same as what I've got. So, for whatever reason,
> my machine seems to be slower than the one you are using. For what is
> worth, this website [1] seems to confirm the difference. While clock
> speeds are similar, your machine has more Ghz in Turbo boost mode and
> it's 3-4 years newer than mine, so I'd expect that to make a difference
> in terms of internal optimizations etc. Note that I'm able to beat the
> numbers of my workstation using my laptop which sports a slightly higher
> frequency and only has 2 cores and 8G of RAM.
>
> [1] -
>
> https://www.cpubenchmark.net/compare/Intel-Xeon-E5-2673-v4-vs-Intel-Xeon-E5-2665/2888vs1439
>
> Maurizio
>
>
>
> On 17/09/18 11:00, Samuel Audet wrote:
> > Thanks for the figures Maurizio! It's finally good to be speaking in
> > numbers. :)
> >
> > However, you're not providing a lot of details about how you actually
> > ran the experiments. So I've decided to run a JMH benchmark on what we
> > get by default with JavaCPP and this declaration:
> >
> >     @Platform(include = "math.h")
> >     public class MyBenchmark {
> >         static { Loader.load(); }
> >
> >         @NoException
> >         public static native double exp(double x);
> >
> >         @State(Scope.Thread)
> >         public static class MyState {
> >             double x;
> >
> >             @Setup(Level.Iteration)
> >             public void setupMethod() {
> >                 x = Math.random();
> >             }
> >         }
> >
> >         @Benchmark
> >         public void testMethod(MyState s, Blackhole bh) {
> >             bh.consume(exp(s.x));
> >         }
> >     }
> >
> > The relevant portion of generated JNI looks like this:
> >
> >     JNIEXPORT jdouble JNICALL Java_org_sample_MyBenchmark_exp(JNIEnv*
> > env, jclass cls, jdouble arg0) {
> >         jdouble rarg = 0;
> >         double rval = exp(arg0);
> >         rarg = (jdouble)rval;
> >         return rarg;
> >     }
> >
> > And with access to just 2 virtual cores of an Intel(R) Xeon(R) CPU
> > E5-2673 v4 @ 2.30GHz and 8 GB of RAM on the cloud (so probably slower
> > than your E5-2665 @ 2.40GHz) running Ubuntu 14.04 with GCC 4.9 and
> > OpenJDK 8, I get these numbers:
> > Benchmark                Mode  Cnt         Score        Error Units
> > MyBenchmark.testMethod  thrpt   25  37183556.094 ± 460795.746 ops/s
> >
> > I'm not sure how that compares with your numbers exactly, but it does
> > seem to me that what you get for JNI is a bit low. If you could
> > provide more details about how to reproduce your results, that would
> > be great.
> >
> > Samuel
> >
> >
> > On 09/14/2018 10:19 PM, Maurizio Cimadamore wrote:
> >> Hi,
> >> over the last few weeks I've been busy playing with Panama and
> >> assessing performances with JMH. For those just interested in raw
> >> numbers, the results of my explorations can be found here [1]. But as
> >> all benchmarks, I think it's better to spend few words to understand
> >> what these numbers actually _mean_.
> >>
> >> To evaluate the performances of Panama I have first created a
> >> baseline using JNI - more specifically I wanted to assess
> >> performances of three calls (all part of the C std library), namely
> >> `getpid`, `exp` and `qsort`.
> >>
> >> The first example is the de facto benchmark for FFIs - since it does
> >> relatively little computation, it is a good test to measure the
> >> 'latency' of the FFI approach (e.g. how long does it take to go to
> >> native). The second example is also relatively simple, but the this
> >> time the function takes a double argument. The third test is akin to
> >> an FFI torture test, since not only it passes substantially more
> >> arguments (4) but one of these arguments is also a callback - a
> >> pointer to a function that is used to sort the contents of the input
> >> array.
> >>
> >> As expected, the first batch of JNI results confirms our
> >> expectations: `getpid` is the fastest, followed by `exp`, and then
> >> followed by `qsort`. Note that qsort is not even close in terms of
> >> raw numbers to the other two tests - that's because, to sort the
> >> array we need to do (N * log N) upcalls into Java. In the benchmark,
> >> N = 8 and we do the upcalls using the JNI function
> >> JNIEnv::CallIntMethod.
> >>
> >> Now let's examine the second batch of results; these call `getpid`,
> >> `exp` and `qsort` using Panama. The numbers here are considerably
> >> lower than the JNI ones for all the three benchmark - although the
> >> first two seem to be the most problematic. To explain these results
> >> we need to peek under the hood. Panama implements foreign calls
> >> through a so called 'universal adapter' which, given a calling scheme
> >> and a bunch of arguments (machine words) shuffles these arguments in
> >> the right registers/stack slot and then jumps to the target native
> >> function - after which another round of adaptation must be performed
> >> (e.g. to recover the return value from the right register/memory
> >> location).
> >>
> >> Needless to say, all this generality comes at a cost - some of the
> >> cost is in Java - e.g. all arguments have to be packaged up into a
> >> long array (although this component doesn't seem to show up much in
> >> the generated JVM compiled code). A lot of the cost is in the adapter
> >> logic itself - which has to look at the 'call recipe' and move
> >> arguments around accordingly - more specifically, in order to call
> >> the native function, the adapter creates a bunch of helper C++
> >> objects and structs which model the CPU state (e.g. in the
> >> ShuffleDowncallContext struct, we find a field for each register to
> >> be modeled in the target architecture). The adapter has to first move
> >> the values coming from the Java world (stored in the aforementioned
> >> long array) into the right context fields (and it needs to do so by
> >> looking at the recipe, which involves iteration over the recipe
> >> elements). After that's done, we can jump into the assembly stub that
> >> does the native call - this stub will take as input one of those
> >> ShuffleDowncallContext structure and will load the corresponding
> >> registers/create necessary stack slots ahead of the call.
> >>
> >> As you can see, there's quite a lot of action going on here, and this
> >> explains the benchmark numbers; of course, if you are calling a
> >> native function that does a lot of computation, this adaptation cost
> >> will wash out - but for relatively quick calls such as 'getpid' and
> >> 'exp' the latency dominates the picture.
> >>
> >> Digression: the callback logic suffers pretty much from the same
> >> issues, albeit in a reversed order - this time it's the Java code
> >> which receives a 'snapshot' of the register values from a generated
> >> assembly adapter; the Java code can then read such values (using the
> >> Pointer API), turn them into Java objects, call the target Java
> >> method and store the results (after another conversion) in the right
> >> location of the snapshot. The assembly adapter will then pick up the
> >> values set onto the snapshot by the Java code, store them into the
> >> corresponding registers and return control to the native callee. In
> >> the remainder of this email we will not discuss callbacks in details
> >> - we will just posit that for any optimization technique that can be
> >> defined, there exists a _dual_ strategy that works with callbacks.
> >>
> >> How can we make sensible native calls go faster? Well, one obvious
> >> way would be to optimize the universal adapter so that we get a
> >> specialized assembly stub for each code shape. If we do that, we can
> >> move pretty much all of the computation described above from
> >> execution time to the stub generation time, so that, by the time we
> >> have to call the native function, we just have to populate the right
> >> registers (the specialized stub knows where to find them) and jump.
> >> While this sounds a good approach, it feels like there's also a move
> >> for the JIT somewhere in there - after all, the JVM knows which calls
> >> are hot and in need for optimization, so perhaps this specialization
> >> process (some or all of it) could happen dynamically. And this is
> >> indeed an approach we'd like to aim for in the long run.
> >>
> >> Now, few years ago, Vlad put together a patch which now lives in the
> >> 'linkToNative' branch [6, 7] - the goal of this patch is to implement
> >> the approach described above: generate a specialized assembly adapter
> >> for a given native signature, and then leverage the JIT to optimize
> >> it away, turning the adapter into a bare, direct, native method call.
> >> As you can see from the third batch of benchmarks, if we tweak Panama
> >> to use the linkToNative machinery, the speed up is really impressive,
> >> and we end up being much faster than JNI (up to 4x for getPid).
> >>
> >> Unfortunately, the technology in the linkToNative branch is not ready
> >> from prime time (yet) - first, it doesn't cover some useful cases
> >> (e.g. varargs, multiple returns via registers, arguments passed in
> >> memory). That is, the logic assumes there's a 1-1 mapping between a
> >> Java signature and the native function to be called - and that the
> >> arguments passed from Java will either be longs or doubles. While we
> >> can workaround this limitation and define the necessary marshalling
> >> logic in Java (as I have done to run this benchmark), some of the
> >> limitations (multiple returns, structs passed by value which are too
> >> big) cannot simply be worked around. But that's fine, we can still
> >> have a fast path for those calls which have certain characteristics
> >> and a slow path (through the universal adapter) for all the other calls.
> >>
> >> But there's a second and more serious issue lurking: as you can see
> >> in the benchmark, I was not able to get the qsort benchmark running
> >> when using the linkToNative backend. The reason is that the
> >> linkToNative code is still pretty raw, and it doesn't fully adhere to
> >> the JVM internal conventions - e.g. there are missing thread state
> >> transitions which, in the case of upcalls into Java, create issues
> >> when it comes to garbage collection, as the GC cannot parse the
> >> native stack in the correct way.
> >>
> >> This means that, while there's a clear shining path ahead of us, it
> >> is simply too early to just use the linkToNative backend from Panama.
> >> For this reason, I've been looking into some kind of stopgap solution
> >> - another way of optimizing native calls (and upcalls into Java) that
> >> doesn't require too much VM magic. Now, a crucial observation is
> >> that, in many native calls, there is indeed a 1-1 mapping between
> >> Java arguments and native arguments (and back, for return values).
> >> That is, we can think of calling a native function as a process that
> >> takes a bunch of Java arguments, turn them into native arguments
> >> (either double or longs), calls the native methods and then turns
> >> back the result into Java.
> >>
> >> The mapping between Java arguments and C values is quite simple:
> >>
> >> * primitives: either long or double, depending on whether they
> >> describe an integral value or a floating point one.
> >> * pointers: they convert to a long
> >> * callbacks: they also convert to a long
> >> * structs: they are recursively decomposed into fields and each field
> >> is marshalled separately (assuming the struct is not too big, in
> >> which case is passed in memory)
> >>
> >> So, in principle, we could define a bunch of native entry points in
> >> the VM, one per shape, which take a bunch of long and doubles and
> >> call an underlying function with those arguments. For instance, let's
> >> consider the case of a native function which is modelled in Java as:
> >>
> >> int m(Pointer<Foo>, double)
> >>
> >> To call this native function we have to first turn the Java arguments
> >> into a (long, double) pair. Then we need to call a native adapter
> >> that looks like the following:
> >>
> >> jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused, jlong addr,
> >> jlong arg0, jdouble arg1) {
> >>      return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
> >> }
> >>
> >> And this will take care of calling the native function and returning
> >> the value back. This is, admittedly, a very simple solution; of
> >> course there are limitations: we have to define a bunch of
> >> specialized native entry point (and Java entry points, for
> >> callbacks). But here we can play a trick: most of moderns ABI pass
> >> arguments in registers; for instance System V ABI [5] uses up to 6
> >> (!!) integer registers and 7 (!!) MMXr registers for FP values - this
> >> gives us a total of 13 registers available for argument passing.
> >> Which covers quite a lot of cases. Now, if we have a call where _all_
> >> arguments are passed in registers, then the order in which these
> >> arguments are declared in the adapter doesn't matter! That is, since
> >> FP-values will always be passed in different register from integral
> >> values, we can just define entry points which look like these:
> >>
> >> invokeNative_V_DDDDD
> >> invokeNative_V_JDDDD
> >> invokeNative_V_JJDDD
> >> invokeNative_V_JJJDD
> >> invokeNative_V_JJJJD
> >> invokeNative_V_JJJJJ
> >>
> >> That is, for a given arity (5 in this case), we can just put all long
> >> arguments in front, and the double arguments after that. That is, we
> >> don't need to generate all possible permutations of J/D in all
> >> positions - as the adapter will always do the same thing (read: load
> >> from same registers) for all equivalent combinations. This keeps the
> >> number of entry points in check - and it also poses some challenges
> >> to the Java logic in charge of marshalling/unmarshalling, as there's
> >> an extra permutation step involved (although that is not something
> >> super-hard to address).
> >>
> >> You can see the performance numbers associated with this invocation
> >> scheme (which I've dubbed 'direct') in the 4th batch of the benchmark
> >> results. These numbers are on par (and slightly better) with JNI in
> >> all the three cases considered which is, I think, a very positive
> >> result, given that to write these benchmarks I did not have to write
> >> a single line of JNI code. In other words, this optimization gives
> >> you the same speed as JNI, with improved ease of use (**).
> >>
> >> Now, since the 'direct' optimization builds on top of the VM native
> >> call adapters, this approach is significantly more robust than
> >> linkToNative and I have not run into any weird VM crashes when
> >> playing with it. The downside of that, is that, for obvious reasons,
> >> this approach cannot get much faster than JNI - that is, it cannot
> >> get close to the numbers obtained with the linkToNative backend,
> >> which features much deeper optimizations. But I think that, despite
> >> its limitations, it's still a good opportunistic improvement that is
> >> worth pursuing in the short term (while we sort out the linkToNative
> >> story). For this reason, I will soon be submitting a review which
> >> incorporates the changes for the 'direct' invocation schemes.
> >>
> >> Cheers
> >> Maurizio
> >>
> >> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
> >> [2] - https://github.com/jnr/jnr-ffi
> >> [3] - https://github.com/jnr/jffi
> >> [4] - https://sourceware.org/libffi/
> >> [5] -
> >>
> https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
> >>
> >> [6] -
> >> http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
> >> [7] - http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354
> >>
> >> (**) the benchmark also contains a 5th row in which I repeated same
> >> tests, this time using JNR [2]. JNR is built on top of libjffi [3], a
> >> JNI library in turn built on top of the popular libffi [4]. I wanted
> >> to have some numbers about JNR because that's another solution that
> >> allows for better ease to use, taking care of marshalling Java values
> >> into C and back; since the goals of JNR are similar in spirit with
> >> some of the goals of the Panama/foreign work, I thought it would be
> >> worth having a comparison of these approaches. For the records, I
> >> think the JNR numbers are very respectable given that JNR had to do
> >> all the hard work outside of the JDK!
> >>
> >>
> >
>
>