[foreign] some JMH benchmarks

Mon Sep 17 10:00:21 UTC 2018

Thanks for the figures Maurizio! It's finally good to be speaking in 
numbers. :)

However, you're not providing a lot of details about how you actually 
ran the experiments. So I've decided to run a JMH benchmark on what we 
get by default with JavaCPP and this declaration:

     @Platform(include = "math.h")
     public class MyBenchmark {
         static { Loader.load(); }

         @NoException
         public static native double exp(double x);

         @State(Scope.Thread)
         public static class MyState {
             double x;

             @Setup(Level.Iteration)
             public void setupMethod() {
                 x = Math.random();
             }
         }

         @Benchmark
         public void testMethod(MyState s, Blackhole bh) {
             bh.consume(exp(s.x));
         }
     }

The relevant portion of generated JNI looks like this:

     JNIEXPORT jdouble JNICALL Java_org_sample_MyBenchmark_exp(JNIEnv* 
env, jclass cls, jdouble arg0) {
         jdouble rarg = 0;
         double rval = exp(arg0);
         rarg = (jdouble)rval;
         return rarg;
     }

And with access to just 2 virtual cores of an Intel(R) Xeon(R) CPU 
E5-2673 v4 @ 2.30GHz and 8 GB of RAM on the cloud (so probably slower 
than your E5-2665 @ 2.40GHz) running Ubuntu 14.04 with GCC 4.9 and 
OpenJDK 8, I get these numbers:
Benchmark                Mode  Cnt         Score        Error  Units
MyBenchmark.testMethod  thrpt   25  37183556.094 ± 460795.746  ops/s

I'm not sure how that compares with your numbers exactly, but it does 
seem to me that what you get for JNI is a bit low. If you could provide 
more details about how to reproduce your results, that would be great.

Samuel

On 09/14/2018 10:19 PM, Maurizio Cimadamore wrote:
> Hi,
> over the last few weeks I've been busy playing with Panama and assessing 
> performances with JMH. For those just interested in raw numbers, the 
> results of my explorations can be found here [1]. But as all benchmarks, 
> I think it's better to spend few words to understand what these numbers 
> actually _mean_.
> 
> To evaluate the performances of Panama I have first created a baseline 
> using JNI - more specifically I wanted to assess performances of three 
> calls (all part of the C std library), namely `getpid`, `exp` and `qsort`.
> 
> The first example is the de facto benchmark for FFIs - since it does 
> relatively little computation, it is a good test to measure the 
> 'latency' of the FFI approach (e.g. how long does it take to go to 
> native). The second example is also relatively simple, but the this time 
> the function takes a double argument. The third test is akin to an FFI 
> torture test, since not only it passes substantially more arguments (4) 
> but one of these arguments is also a callback - a pointer to a function 
> that is used to sort the contents of the input array.
> 
> As expected, the first batch of JNI results confirms our expectations: 
> `getpid` is the fastest, followed by `exp`, and then followed by 
> `qsort`. Note that qsort is not even close in terms of raw numbers to 
> the other two tests - that's because, to sort the array we need to do (N 
> * log N) upcalls into Java. In the benchmark, N = 8 and we do the 
> upcalls using the JNI function JNIEnv::CallIntMethod.
> 
> Now let's examine the second batch of results; these call `getpid`, 
> `exp` and `qsort` using Panama. The numbers here are considerably lower 
> than the JNI ones for all the three benchmark - although the first two 
> seem to be the most problematic. To explain these results we need to 
> peek under the hood. Panama implements foreign calls through a so called 
> 'universal adapter' which, given a calling scheme and a bunch of 
> arguments (machine words) shuffles these arguments in the right 
> registers/stack slot and then jumps to the target native function - 
> after which another round of adaptation must be performed (e.g. to 
> recover the return value from the right register/memory location).
> 
> Needless to say, all this generality comes at a cost - some of the cost 
> is in Java - e.g. all arguments have to be packaged up into a long array 
> (although this component doesn't seem to show up much in the generated 
> JVM compiled code). A lot of the cost is in the adapter logic itself - 
> which has to look at the 'call recipe' and move arguments around 
> accordingly - more specifically, in order to call the native function, 
> the adapter creates a bunch of helper C++ objects and structs which 
> model the CPU state (e.g. in the ShuffleDowncallContext struct, we find 
> a field for each register to be modeled in the target architecture). The 
> adapter has to first move the values coming from the Java world (stored 
> in the aforementioned long array) into the right context fields (and it 
> needs to do so by looking at the recipe, which involves iteration over 
> the recipe elements). After that's done, we can jump into the assembly 
> stub that does the native call - this stub will take as input one of 
> those ShuffleDowncallContext structure and will load the corresponding 
> registers/create necessary stack slots ahead of the call.
> 
> As you can see, there's quite a lot of action going on here, and this 
> explains the benchmark numbers; of course, if you are calling a native 
> function that does a lot of computation, this adaptation cost will wash 
> out - but for relatively quick calls such as 'getpid' and 'exp' the 
> latency dominates the picture.
> 
> Digression: the callback logic suffers pretty much from the same issues, 
> albeit in a reversed order - this time it's the Java code which receives 
> a 'snapshot' of the register values from a generated assembly adapter; 
> the Java code can then read such values (using the Pointer API), turn 
> them into Java objects, call the target Java method and store the 
> results (after another conversion) in the right location of the 
> snapshot. The assembly adapter will then pick up the values set onto the 
> snapshot by the Java code, store them into the corresponding registers 
> and return control to the native callee. In the remainder of this email 
> we will not discuss callbacks in details - we will just posit that for 
> any optimization technique that can be defined, there exists a _dual_ 
> strategy that works with callbacks.
> 
> How can we make sensible native calls go faster? Well, one obvious way 
> would be to optimize the universal adapter so that we get a specialized 
> assembly stub for each code shape. If we do that, we can move pretty 
> much all of the computation described above from execution time to the 
> stub generation time, so that, by the time we have to call the native 
> function, we just have to populate the right registers (the specialized 
> stub knows where to find them) and jump. While this sounds a good 
> approach, it feels like there's also a move for the JIT somewhere in 
> there - after all, the JVM knows which calls are hot and in need for 
> optimization, so perhaps this specialization process (some or all of it) 
> could happen dynamically. And this is indeed an approach we'd like to 
> aim for in the long run.
> 
> Now, few years ago, Vlad put together a patch which now lives in the 
> 'linkToNative' branch [6, 7] - the goal of this patch is to implement 
> the approach described above: generate a specialized assembly adapter 
> for a given native signature, and then leverage the JIT to optimize it 
> away, turning the adapter into a bare, direct, native method call. As 
> you can see from the third batch of benchmarks, if we tweak Panama to 
> use the linkToNative machinery, the speed up is really impressive, and 
> we end up being much faster than JNI (up to 4x for getPid).
> 
> Unfortunately, the technology in the linkToNative branch is not ready 
> from prime time (yet) - first, it doesn't cover some useful cases (e.g. 
> varargs, multiple returns via registers, arguments passed in memory). 
> That is, the logic assumes there's a 1-1 mapping between a Java 
> signature and the native function to be called - and that the arguments 
> passed from Java will either be longs or doubles. While we can 
> workaround this limitation and define the necessary marshalling logic in 
> Java (as I have done to run this benchmark), some of the limitations 
> (multiple returns, structs passed by value which are too big) cannot 
> simply be worked around. But that's fine, we can still have a fast path 
> for those calls which have certain characteristics and a slow path 
> (through the universal adapter) for all the other calls.
> 
> But there's a second and more serious issue lurking: as you can see in 
> the benchmark, I was not able to get the qsort benchmark running when 
> using the linkToNative backend. The reason is that the linkToNative code 
> is still pretty raw, and it doesn't fully adhere to the JVM internal 
> conventions - e.g. there are missing thread state transitions which, in 
> the case of upcalls into Java, create issues when it comes to garbage 
> collection, as the GC cannot parse the native stack in the correct way.
> 
> This means that, while there's a clear shining path ahead of us, it is 
> simply too early to just use the linkToNative backend from Panama. For 
> this reason, I've been looking into some kind of stopgap solution - 
> another way of optimizing native calls (and upcalls into Java) that 
> doesn't require too much VM magic. Now, a crucial observation is that, 
> in many native calls, there is indeed a 1-1 mapping between Java 
> arguments and native arguments (and back, for return values). That is, 
> we can think of calling a native function as a process that takes a 
> bunch of Java arguments, turn them into native arguments (either double 
> or longs), calls the native methods and then turns back the result into 
> Java.
> 
> The mapping between Java arguments and C values is quite simple:
> 
> * primitives: either long or double, depending on whether they describe 
> an integral value or a floating point one.
> * pointers: they convert to a long
> * callbacks: they also convert to a long
> * structs: they are recursively decomposed into fields and each field is 
> marshalled separately (assuming the struct is not too big, in which case 
> is passed in memory)
> 
> So, in principle, we could define a bunch of native entry points in the 
> VM, one per shape, which take a bunch of long and doubles and call an 
> underlying function with those arguments. For instance, let's consider 
> the case of a native function which is modelled in Java as:
> 
> int m(Pointer<Foo>, double)
> 
> To call this native function we have to first turn the Java arguments 
> into a (long, double) pair. Then we need to call a native adapter that 
> looks like the following:
> 
> jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused, jlong addr, 
> jlong arg0, jdouble arg1) {
>      return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
> }
> 
> And this will take care of calling the native function and returning the 
> value back. This is, admittedly, a very simple solution; of course there 
> are limitations: we have to define a bunch of specialized native entry 
> point (and Java entry points, for callbacks). But here we can play a 
> trick: most of moderns ABI pass arguments in registers; for instance 
> System V ABI [5] uses up to 6 (!!) integer registers and 7 (!!) MMXr 
> registers for FP values - this gives us a total of 13 registers 
> available for argument passing. Which covers quite a lot of cases. Now, 
> if we have a call where _all_ arguments are passed in registers, then 
> the order in which these arguments are declared in the adapter doesn't 
> matter! That is, since FP-values will always be passed in different 
> register from integral values, we can just define entry points which 
> look like these:
> 
> invokeNative_V_DDDDD
> invokeNative_V_JDDDD
> invokeNative_V_JJDDD
> invokeNative_V_JJJDD
> invokeNative_V_JJJJD
> invokeNative_V_JJJJJ
> 
> That is, for a given arity (5 in this case), we can just put all long 
> arguments in front, and the double arguments after that. That is, we 
> don't need to generate all possible permutations of J/D in all positions 
> - as the adapter will always do the same thing (read: load from same 
> registers) for all equivalent combinations. This keeps the number of 
> entry points in check -  and it also poses some challenges to the Java 
> logic in charge of marshalling/unmarshalling, as there's an extra 
> permutation step involved (although that is not something super-hard to 
> address).
> 
> You can see the performance numbers associated with this invocation 
> scheme (which I've dubbed 'direct') in the 4th batch of the benchmark 
> results. These numbers are on par (and slightly better) with JNI in all 
> the three cases considered which is, I think, a very positive result, 
> given that to write these benchmarks I did not have to write a single 
> line of JNI code. In other words, this optimization gives you the same 
> speed as JNI, with improved ease of use (**).
> 
> Now, since the 'direct' optimization builds on top of the VM native call 
> adapters, this approach is significantly more robust than linkToNative 
> and I have not run into any weird VM crashes when playing with it. The 
> downside of that, is that, for obvious reasons, this approach cannot get 
> much faster than JNI - that is, it cannot get close to the numbers 
> obtained with the linkToNative backend, which features much deeper 
> optimizations. But I think that, despite its limitations, it's still a 
> good opportunistic improvement that is worth pursuing in the short term 
> (while we sort out the linkToNative story). For this reason, I will soon 
> be submitting a review which incorporates the changes for the 'direct' 
> invocation schemes.
> 
> Cheers
> Maurizio
> 
> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
> [2] - https://github.com/jnr/jnr-ffi
> [3] - https://github.com/jnr/jffi
> [4] - https://sourceware.org/libffi/
> [5] - 
> https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf 
> 
> [6] - http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
> [7] - http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354
> 
> (**) the benchmark also contains a 5th row in which I repeated same 
> tests, this time using JNR [2]. JNR is built on top of libjffi [3], a 
> JNI library in turn built on top of the popular libffi [4]. I wanted to 
> have some numbers about JNR because that's another solution that allows 
> for better ease to use, taking care of marshalling Java values into C 
> and back; since the goals of JNR are similar in spirit with some of the 
> goals of the Panama/foreign work, I thought it would be worth having a 
> comparison of these approaches. For the records, I think the JNR numbers 
> are very respectable given that JNR had to do all the hard work outside 
> of the JDK!
> 
>