[foreign] some JMH benchmarks

Mon Sep 17 12:37:02 UTC 2018

Hi Samuel,
I was planning to upload the benchmark IDE project in the near future (I 
need to clean that up a bit, so that it can be opened at ease):

My getpid example is like this; this is the Java decl:

public class GetPid {

     static {
         System.loadLibrary("getpid");
     }

     native static long getpid();

     native double exp(double base);
}

This is the JNI code:

JNIEXPORT jlong JNICALL Java_org_panama_GetPid_getpid
   (JNIEnv *env, jobject recv) {
    return getpid();
}

JNIEXPORT jdouble JNICALL Java_org_panama_GetPid_exp
   (JNIEnv *env, jobject recv, jdouble arg) {
    return exp(arg);
}

And this is the benchmark:

class PanamaBenchmark {
     static GetPid pidlib = new GetPid();

     @Benchmark
     public long testJNIPid() {
         return pidlib.getpid();
     }

     @Benchmark
     public double testJNIExp() {
         return pidlib.exp(10d);
     }
}

I think this should be rather standard?

I'm on Ubuntu 16.04.1, and using GCC 5.4.0. The command I use to compile 
the C lib is this:

gcc -I<path to jni,h> -l<path to jni lib> -shared -o libgetpid.so -fPIC 
GetPid.c

One difference I see between our two examples is the use of BlackHole. 
In my bench, I'm just returning the call to 'exp' - which should be 
equivalent, and, actually, preferred, as described here:

http://hg.openjdk.java.net/code-tools/jmh/file/3769055ad883/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_09_Blackholes.java#l51

Another minor difference I see is that I pass a constant argument, while 
you generate a random number on each iteration.

I tried to cut and paste your benchmark and I got this:

Benchmark                    Mode  Cnt         Score Error  Units
PanamaBenchmark.testMethod  thrpt    5  26362701.827 ± 1357012.981  ops/s

Which looks exactly the same as what I've got. So, for whatever reason, 
my machine seems to be slower than the one you are using. For what is 
worth, this website [1] seems to confirm the difference. While clock 
speeds are similar, your machine has more Ghz in Turbo boost mode and 
it's 3-4 years newer than mine, so I'd expect that to make a difference 
in terms of internal optimizations etc. Note that I'm able to beat the 
numbers of my workstation using my laptop which sports a slightly higher 
frequency and only has 2 cores and 8G of RAM.

[1] - 
https://www.cpubenchmark.net/compare/Intel-Xeon-E5-2673-v4-vs-Intel-Xeon-E5-2665/2888vs1439

Maurizio

On 17/09/18 11:00, Samuel Audet wrote:
> Thanks for the figures Maurizio! It's finally good to be speaking in 
> numbers. :)
>
> However, you're not providing a lot of details about how you actually 
> ran the experiments. So I've decided to run a JMH benchmark on what we 
> get by default with JavaCPP and this declaration:
>
>     @Platform(include = "math.h")
>     public class MyBenchmark {
>         static { Loader.load(); }
>
>         @NoException
>         public static native double exp(double x);
>
>         @State(Scope.Thread)
>         public static class MyState {
>             double x;
>
>             @Setup(Level.Iteration)
>             public void setupMethod() {
>                 x = Math.random();
>             }
>         }
>
>         @Benchmark
>         public void testMethod(MyState s, Blackhole bh) {
>             bh.consume(exp(s.x));
>         }
>     }
>
> The relevant portion of generated JNI looks like this:
>
>     JNIEXPORT jdouble JNICALL Java_org_sample_MyBenchmark_exp(JNIEnv* 
> env, jclass cls, jdouble arg0) {
>         jdouble rarg = 0;
>         double rval = exp(arg0);
>         rarg = (jdouble)rval;
>         return rarg;
>     }
>
> And with access to just 2 virtual cores of an Intel(R) Xeon(R) CPU 
> E5-2673 v4 @ 2.30GHz and 8 GB of RAM on the cloud (so probably slower 
> than your E5-2665 @ 2.40GHz) running Ubuntu 14.04 with GCC 4.9 and 
> OpenJDK 8, I get these numbers:
> Benchmark                Mode  Cnt         Score        Error Units
> MyBenchmark.testMethod  thrpt   25  37183556.094 ± 460795.746 ops/s
>
> I'm not sure how that compares with your numbers exactly, but it does 
> seem to me that what you get for JNI is a bit low. If you could 
> provide more details about how to reproduce your results, that would 
> be great.
>
> Samuel
>
>
> On 09/14/2018 10:19 PM, Maurizio Cimadamore wrote:
>> Hi,
>> over the last few weeks I've been busy playing with Panama and 
>> assessing performances with JMH. For those just interested in raw 
>> numbers, the results of my explorations can be found here [1]. But as 
>> all benchmarks, I think it's better to spend few words to understand 
>> what these numbers actually _mean_.
>>
>> To evaluate the performances of Panama I have first created a 
>> baseline using JNI - more specifically I wanted to assess 
>> performances of three calls (all part of the C std library), namely 
>> `getpid`, `exp` and `qsort`.
>>
>> The first example is the de facto benchmark for FFIs - since it does 
>> relatively little computation, it is a good test to measure the 
>> 'latency' of the FFI approach (e.g. how long does it take to go to 
>> native). The second example is also relatively simple, but the this 
>> time the function takes a double argument. The third test is akin to 
>> an FFI torture test, since not only it passes substantially more 
>> arguments (4) but one of these arguments is also a callback - a 
>> pointer to a function that is used to sort the contents of the input 
>> array.
>>
>> As expected, the first batch of JNI results confirms our 
>> expectations: `getpid` is the fastest, followed by `exp`, and then 
>> followed by `qsort`. Note that qsort is not even close in terms of 
>> raw numbers to the other two tests - that's because, to sort the 
>> array we need to do (N * log N) upcalls into Java. In the benchmark, 
>> N = 8 and we do the upcalls using the JNI function 
>> JNIEnv::CallIntMethod.
>>
>> Now let's examine the second batch of results; these call `getpid`, 
>> `exp` and `qsort` using Panama. The numbers here are considerably 
>> lower than the JNI ones for all the three benchmark - although the 
>> first two seem to be the most problematic. To explain these results 
>> we need to peek under the hood. Panama implements foreign calls 
>> through a so called 'universal adapter' which, given a calling scheme 
>> and a bunch of arguments (machine words) shuffles these arguments in 
>> the right registers/stack slot and then jumps to the target native 
>> function - after which another round of adaptation must be performed 
>> (e.g. to recover the return value from the right register/memory 
>> location).
>>
>> Needless to say, all this generality comes at a cost - some of the 
>> cost is in Java - e.g. all arguments have to be packaged up into a 
>> long array (although this component doesn't seem to show up much in 
>> the generated JVM compiled code). A lot of the cost is in the adapter 
>> logic itself - which has to look at the 'call recipe' and move 
>> arguments around accordingly - more specifically, in order to call 
>> the native function, the adapter creates a bunch of helper C++ 
>> objects and structs which model the CPU state (e.g. in the 
>> ShuffleDowncallContext struct, we find a field for each register to 
>> be modeled in the target architecture). The adapter has to first move 
>> the values coming from the Java world (stored in the aforementioned 
>> long array) into the right context fields (and it needs to do so by 
>> looking at the recipe, which involves iteration over the recipe 
>> elements). After that's done, we can jump into the assembly stub that 
>> does the native call - this stub will take as input one of those 
>> ShuffleDowncallContext structure and will load the corresponding 
>> registers/create necessary stack slots ahead of the call.
>>
>> As you can see, there's quite a lot of action going on here, and this 
>> explains the benchmark numbers; of course, if you are calling a 
>> native function that does a lot of computation, this adaptation cost 
>> will wash out - but for relatively quick calls such as 'getpid' and 
>> 'exp' the latency dominates the picture.
>>
>> Digression: the callback logic suffers pretty much from the same 
>> issues, albeit in a reversed order - this time it's the Java code 
>> which receives a 'snapshot' of the register values from a generated 
>> assembly adapter; the Java code can then read such values (using the 
>> Pointer API), turn them into Java objects, call the target Java 
>> method and store the results (after another conversion) in the right 
>> location of the snapshot. The assembly adapter will then pick up the 
>> values set onto the snapshot by the Java code, store them into the 
>> corresponding registers and return control to the native callee. In 
>> the remainder of this email we will not discuss callbacks in details 
>> - we will just posit that for any optimization technique that can be 
>> defined, there exists a _dual_ strategy that works with callbacks.
>>
>> How can we make sensible native calls go faster? Well, one obvious 
>> way would be to optimize the universal adapter so that we get a 
>> specialized assembly stub for each code shape. If we do that, we can 
>> move pretty much all of the computation described above from 
>> execution time to the stub generation time, so that, by the time we 
>> have to call the native function, we just have to populate the right 
>> registers (the specialized stub knows where to find them) and jump. 
>> While this sounds a good approach, it feels like there's also a move 
>> for the JIT somewhere in there - after all, the JVM knows which calls 
>> are hot and in need for optimization, so perhaps this specialization 
>> process (some or all of it) could happen dynamically. And this is 
>> indeed an approach we'd like to aim for in the long run.
>>
>> Now, few years ago, Vlad put together a patch which now lives in the 
>> 'linkToNative' branch [6, 7] - the goal of this patch is to implement 
>> the approach described above: generate a specialized assembly adapter 
>> for a given native signature, and then leverage the JIT to optimize 
>> it away, turning the adapter into a bare, direct, native method call. 
>> As you can see from the third batch of benchmarks, if we tweak Panama 
>> to use the linkToNative machinery, the speed up is really impressive, 
>> and we end up being much faster than JNI (up to 4x for getPid).
>>
>> Unfortunately, the technology in the linkToNative branch is not ready 
>> from prime time (yet) - first, it doesn't cover some useful cases 
>> (e.g. varargs, multiple returns via registers, arguments passed in 
>> memory). That is, the logic assumes there's a 1-1 mapping between a 
>> Java signature and the native function to be called - and that the 
>> arguments passed from Java will either be longs or doubles. While we 
>> can workaround this limitation and define the necessary marshalling 
>> logic in Java (as I have done to run this benchmark), some of the 
>> limitations (multiple returns, structs passed by value which are too 
>> big) cannot simply be worked around. But that's fine, we can still 
>> have a fast path for those calls which have certain characteristics 
>> and a slow path (through the universal adapter) for all the other calls.
>>
>> But there's a second and more serious issue lurking: as you can see 
>> in the benchmark, I was not able to get the qsort benchmark running 
>> when using the linkToNative backend. The reason is that the 
>> linkToNative code is still pretty raw, and it doesn't fully adhere to 
>> the JVM internal conventions - e.g. there are missing thread state 
>> transitions which, in the case of upcalls into Java, create issues 
>> when it comes to garbage collection, as the GC cannot parse the 
>> native stack in the correct way.
>>
>> This means that, while there's a clear shining path ahead of us, it 
>> is simply too early to just use the linkToNative backend from Panama. 
>> For this reason, I've been looking into some kind of stopgap solution 
>> - another way of optimizing native calls (and upcalls into Java) that 
>> doesn't require too much VM magic. Now, a crucial observation is 
>> that, in many native calls, there is indeed a 1-1 mapping between 
>> Java arguments and native arguments (and back, for return values). 
>> That is, we can think of calling a native function as a process that 
>> takes a bunch of Java arguments, turn them into native arguments 
>> (either double or longs), calls the native methods and then turns 
>> back the result into Java.
>>
>> The mapping between Java arguments and C values is quite simple:
>>
>> * primitives: either long or double, depending on whether they 
>> describe an integral value or a floating point one.
>> * pointers: they convert to a long
>> * callbacks: they also convert to a long
>> * structs: they are recursively decomposed into fields and each field 
>> is marshalled separately (assuming the struct is not too big, in 
>> which case is passed in memory)
>>
>> So, in principle, we could define a bunch of native entry points in 
>> the VM, one per shape, which take a bunch of long and doubles and 
>> call an underlying function with those arguments. For instance, let's 
>> consider the case of a native function which is modelled in Java as:
>>
>> int m(Pointer<Foo>, double)
>>
>> To call this native function we have to first turn the Java arguments 
>> into a (long, double) pair. Then we need to call a native adapter 
>> that looks like the following:
>>
>> jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused, jlong addr, 
>> jlong arg0, jdouble arg1) {
>>      return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
>> }
>>
>> And this will take care of calling the native function and returning 
>> the value back. This is, admittedly, a very simple solution; of 
>> course there are limitations: we have to define a bunch of 
>> specialized native entry point (and Java entry points, for 
>> callbacks). But here we can play a trick: most of moderns ABI pass 
>> arguments in registers; for instance System V ABI [5] uses up to 6 
>> (!!) integer registers and 7 (!!) MMXr registers for FP values - this 
>> gives us a total of 13 registers available for argument passing. 
>> Which covers quite a lot of cases. Now, if we have a call where _all_ 
>> arguments are passed in registers, then the order in which these 
>> arguments are declared in the adapter doesn't matter! That is, since 
>> FP-values will always be passed in different register from integral 
>> values, we can just define entry points which look like these:
>>
>> invokeNative_V_DDDDD
>> invokeNative_V_JDDDD
>> invokeNative_V_JJDDD
>> invokeNative_V_JJJDD
>> invokeNative_V_JJJJD
>> invokeNative_V_JJJJJ
>>
>> That is, for a given arity (5 in this case), we can just put all long 
>> arguments in front, and the double arguments after that. That is, we 
>> don't need to generate all possible permutations of J/D in all 
>> positions - as the adapter will always do the same thing (read: load 
>> from same registers) for all equivalent combinations. This keeps the 
>> number of entry points in check - and it also poses some challenges 
>> to the Java logic in charge of marshalling/unmarshalling, as there's 
>> an extra permutation step involved (although that is not something 
>> super-hard to address).
>>
>> You can see the performance numbers associated with this invocation 
>> scheme (which I've dubbed 'direct') in the 4th batch of the benchmark 
>> results. These numbers are on par (and slightly better) with JNI in 
>> all the three cases considered which is, I think, a very positive 
>> result, given that to write these benchmarks I did not have to write 
>> a single line of JNI code. In other words, this optimization gives 
>> you the same speed as JNI, with improved ease of use (**).
>>
>> Now, since the 'direct' optimization builds on top of the VM native 
>> call adapters, this approach is significantly more robust than 
>> linkToNative and I have not run into any weird VM crashes when 
>> playing with it. The downside of that, is that, for obvious reasons, 
>> this approach cannot get much faster than JNI - that is, it cannot 
>> get close to the numbers obtained with the linkToNative backend, 
>> which features much deeper optimizations. But I think that, despite 
>> its limitations, it's still a good opportunistic improvement that is 
>> worth pursuing in the short term (while we sort out the linkToNative 
>> story). For this reason, I will soon be submitting a review which 
>> incorporates the changes for the 'direct' invocation schemes.
>>
>> Cheers
>> Maurizio
>>
>> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>> [2] - https://github.com/jnr/jnr-ffi
>> [3] - https://github.com/jnr/jffi
>> [4] - https://sourceware.org/libffi/
>> [5] - 
>> https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf 
>>
>> [6] - 
>> http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
>> [7] - http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354
>>
>> (**) the benchmark also contains a 5th row in which I repeated same 
>> tests, this time using JNR [2]. JNR is built on top of libjffi [3], a 
>> JNI library in turn built on top of the popular libffi [4]. I wanted 
>> to have some numbers about JNR because that's another solution that 
>> allows for better ease to use, taking care of marshalling Java values 
>> into C and back; since the goals of JNR are similar in spirit with 
>> some of the goals of the Panama/foreign work, I thought it would be 
>> worth having a comparison of these approaches. For the records, I 
>> think the JNR numbers are very respectable given that JNR had to do 
>> all the hard work outside of the JDK!
>>
>>
>