[foreign] some JMH benchmarks

Fri Sep 14 13:19:26 UTC 2018

Hi,
over the last few weeks I've been busy playing with Panama and assessing 
performances with JMH. For those just interested in raw numbers, the 
results of my explorations can be found here [1]. But as all benchmarks, 
I think it's better to spend few words to understand what these numbers 
actually _mean_.

To evaluate the performances of Panama I have first created a baseline 
using JNI - more specifically I wanted to assess performances of three 
calls (all part of the C std library), namely `getpid`, `exp` and `qsort`.

The first example is the de facto benchmark for FFIs - since it does 
relatively little computation, it is a good test to measure the 
'latency' of the FFI approach (e.g. how long does it take to go to 
native). The second example is also relatively simple, but the this time 
the function takes a double argument. The third test is akin to an FFI 
torture test, since not only it passes substantially more arguments (4) 
but one of these arguments is also a callback - a pointer to a function 
that is used to sort the contents of the input array.

As expected, the first batch of JNI results confirms our expectations: 
`getpid` is the fastest, followed by `exp`, and then followed by 
`qsort`. Note that qsort is not even close in terms of raw numbers to 
the other two tests - that's because, to sort the array we need to do (N 
* log N) upcalls into Java. In the benchmark, N = 8 and we do the 
upcalls using the JNI function JNIEnv::CallIntMethod.

Now let's examine the second batch of results; these call `getpid`, 
`exp` and `qsort` using Panama. The numbers here are considerably lower 
than the JNI ones for all the three benchmark - although the first two 
seem to be the most problematic. To explain these results we need to 
peek under the hood. Panama implements foreign calls through a so called 
'universal adapter' which, given a calling scheme and a bunch of 
arguments (machine words) shuffles these arguments in the right 
registers/stack slot and then jumps to the target native function - 
after which another round of adaptation must be performed (e.g. to 
recover the return value from the right register/memory location).

Needless to say, all this generality comes at a cost - some of the cost 
is in Java - e.g. all arguments have to be packaged up into a long array 
(although this component doesn't seem to show up much in the generated 
JVM compiled code). A lot of the cost is in the adapter logic itself - 
which has to look at the 'call recipe' and move arguments around 
accordingly - more specifically, in order to call the native function, 
the adapter creates a bunch of helper C++ objects and structs which 
model the CPU state (e.g. in the ShuffleDowncallContext struct, we find 
a field for each register to be modeled in the target architecture). The 
adapter has to first move the values coming from the Java world (stored 
in the aforementioned long array) into the right context fields (and it 
needs to do so by looking at the recipe, which involves iteration over 
the recipe elements). After that's done, we can jump into the assembly 
stub that does the native call - this stub will take as input one of 
those ShuffleDowncallContext structure and will load the corresponding 
registers/create necessary stack slots ahead of the call.

As you can see, there's quite a lot of action going on here, and this 
explains the benchmark numbers; of course, if you are calling a native 
function that does a lot of computation, this adaptation cost will wash 
out - but for relatively quick calls such as 'getpid' and 'exp' the 
latency dominates the picture.

Digression: the callback logic suffers pretty much from the same issues, 
albeit in a reversed order - this time it's the Java code which receives 
a 'snapshot' of the register values from a generated assembly adapter; 
the Java code can then read such values (using the Pointer API), turn 
them into Java objects, call the target Java method and store the 
results (after another conversion) in the right location of the 
snapshot. The assembly adapter will then pick up the values set onto the 
snapshot by the Java code, store them into the corresponding registers 
and return control to the native callee. In the remainder of this email 
we will not discuss callbacks in details - we will just posit that for 
any optimization technique that can be defined, there exists a _dual_ 
strategy that works with callbacks.

How can we make sensible native calls go faster? Well, one obvious way 
would be to optimize the universal adapter so that we get a specialized 
assembly stub for each code shape. If we do that, we can move pretty 
much all of the computation described above from execution time to the 
stub generation time, so that, by the time we have to call the native 
function, we just have to populate the right registers (the specialized 
stub knows where to find them) and jump. While this sounds a good 
approach, it feels like there's also a move for the JIT somewhere in 
there - after all, the JVM knows which calls are hot and in need for 
optimization, so perhaps this specialization process (some or all of it) 
could happen dynamically. And this is indeed an approach we'd like to 
aim for in the long run.

Now, few years ago, Vlad put together a patch which now lives in the 
'linkToNative' branch [6, 7] - the goal of this patch is to implement 
the approach described above: generate a specialized assembly adapter 
for a given native signature, and then leverage the JIT to optimize it 
away, turning the adapter into a bare, direct, native method call. As 
you can see from the third batch of benchmarks, if we tweak Panama to 
use the linkToNative machinery, the speed up is really impressive, and 
we end up being much faster than JNI (up to 4x for getPid).

Unfortunately, the technology in the linkToNative branch is not ready 
from prime time (yet) - first, it doesn't cover some useful cases (e.g. 
varargs, multiple returns via registers, arguments passed in memory). 
That is, the logic assumes there's a 1-1 mapping between a Java 
signature and the native function to be called - and that the arguments 
passed from Java will either be longs or doubles. While we can 
workaround this limitation and define the necessary marshalling logic in 
Java (as I have done to run this benchmark), some of the limitations 
(multiple returns, structs passed by value which are too big) cannot 
simply be worked around. But that's fine, we can still have a fast path 
for those calls which have certain characteristics and a slow path 
(through the universal adapter) for all the other calls.

But there's a second and more serious issue lurking: as you can see in 
the benchmark, I was not able to get the qsort benchmark running when 
using the linkToNative backend. The reason is that the linkToNative code 
is still pretty raw, and it doesn't fully adhere to the JVM internal 
conventions - e.g. there are missing thread state transitions which, in 
the case of upcalls into Java, create issues when it comes to garbage 
collection, as the GC cannot parse the native stack in the correct way.

This means that, while there's a clear shining path ahead of us, it is 
simply too early to just use the linkToNative backend from Panama. For 
this reason, I've been looking into some kind of stopgap solution - 
another way of optimizing native calls (and upcalls into Java) that 
doesn't require too much VM magic. Now, a crucial observation is that, 
in many native calls, there is indeed a 1-1 mapping between Java 
arguments and native arguments (and back, for return values). That is, 
we can think of calling a native function as a process that takes a 
bunch of Java arguments, turn them into native arguments (either double 
or longs), calls the native methods and then turns back the result into 
Java.

The mapping between Java arguments and C values is quite simple:

* primitives: either long or double, depending on whether they describe 
an integral value or a floating point one.
* pointers: they convert to a long
* callbacks: they also convert to a long
* structs: they are recursively decomposed into fields and each field is 
marshalled separately (assuming the struct is not too big, in which case 
is passed in memory)

So, in principle, we could define a bunch of native entry points in the 
VM, one per shape, which take a bunch of long and doubles and call an 
underlying function with those arguments. For instance, let's consider 
the case of a native function which is modelled in Java as:

int m(Pointer<Foo>, double)

To call this native function we have to first turn the Java arguments 
into a (long, double) pair. Then we need to call a native adapter that 
looks like the following:

jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused, jlong addr, 
jlong arg0, jdouble arg1) {
     return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
}

And this will take care of calling the native function and returning the 
value back. This is, admittedly, a very simple solution; of course there 
are limitations: we have to define a bunch of specialized native entry 
point (and Java entry points, for callbacks). But here we can play a 
trick: most of moderns ABI pass arguments in registers; for instance 
System V ABI [5] uses up to 6 (!!) integer registers and 7 (!!) MMXr 
registers for FP values - this gives us a total of 13 registers 
available for argument passing. Which covers quite a lot of cases. Now, 
if we have a call where _all_ arguments are passed in registers, then 
the order in which these arguments are declared in the adapter doesn't 
matter! That is, since FP-values will always be passed in different 
register from integral values, we can just define entry points which 
look like these:

invokeNative_V_DDDDD
invokeNative_V_JDDDD
invokeNative_V_JJDDD
invokeNative_V_JJJDD
invokeNative_V_JJJJD
invokeNative_V_JJJJJ

That is, for a given arity (5 in this case), we can just put all long 
arguments in front, and the double arguments after that. That is, we 
don't need to generate all possible permutations of J/D in all positions 
- as the adapter will always do the same thing (read: load from same 
registers) for all equivalent combinations. This keeps the number of 
entry points in check -  and it also poses some challenges to the Java 
logic in charge of marshalling/unmarshalling, as there's an extra 
permutation step involved (although that is not something super-hard to 
address).

You can see the performance numbers associated with this invocation 
scheme (which I've dubbed 'direct') in the 4th batch of the benchmark 
results. These numbers are on par (and slightly better) with JNI in all 
the three cases considered which is, I think, a very positive result, 
given that to write these benchmarks I did not have to write a single 
line of JNI code. In other words, this optimization gives you the same 
speed as JNI, with improved ease of use (**).

Now, since the 'direct' optimization builds on top of the VM native call 
adapters, this approach is significantly more robust than linkToNative 
and I have not run into any weird VM crashes when playing with it. The 
downside of that, is that, for obvious reasons, this approach cannot get 
much faster than JNI - that is, it cannot get close to the numbers 
obtained with the linkToNative backend, which features much deeper 
optimizations. But I think that, despite its limitations, it's still a 
good opportunistic improvement that is worth pursuing in the short term 
(while we sort out the linkToNative story). For this reason, I will soon 
be submitting a review which incorporates the changes for the 'direct' 
invocation schemes.

Cheers
Maurizio

[1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
[2] - https://github.com/jnr/jnr-ffi
[3] - https://github.com/jnr/jffi
[4] - https://sourceware.org/libffi/
[5] - 
https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
[6] - http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
[7] - http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354

(**) the benchmark also contains a 5th row in which I repeated same 
tests, this time using JNR [2]. JNR is built on top of libjffi [3], a 
JNI library in turn built on top of the popular libffi [4]. I wanted to 
have some numbers about JNR because that's another solution that allows 
for better ease to use, taking care of marshalling Java values into C 
and back; since the goals of JNR are similar in spirit with some of the 
goals of the Panama/foreign work, I thought it would be worth having a 
comparison of these approaches. For the records, I think the JNR numbers 
are very respectable given that JNR had to do all the hard work outside 
of the JDK!