[foreign] some JMH benchmarks
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Sep 14 13:19:26 UTC 2018
Hi,
over the last few weeks I've been busy playing with Panama and assessing
performances with JMH. For those just interested in raw numbers, the
results of my explorations can be found here [1]. But as all benchmarks,
I think it's better to spend few words to understand what these numbers
actually _mean_.
To evaluate the performances of Panama I have first created a baseline
using JNI - more specifically I wanted to assess performances of three
calls (all part of the C std library), namely `getpid`, `exp` and `qsort`.
The first example is the de facto benchmark for FFIs - since it does
relatively little computation, it is a good test to measure the
'latency' of the FFI approach (e.g. how long does it take to go to
native). The second example is also relatively simple, but the this time
the function takes a double argument. The third test is akin to an FFI
torture test, since not only it passes substantially more arguments (4)
but one of these arguments is also a callback - a pointer to a function
that is used to sort the contents of the input array.
As expected, the first batch of JNI results confirms our expectations:
`getpid` is the fastest, followed by `exp`, and then followed by
`qsort`. Note that qsort is not even close in terms of raw numbers to
the other two tests - that's because, to sort the array we need to do (N
* log N) upcalls into Java. In the benchmark, N = 8 and we do the
upcalls using the JNI function JNIEnv::CallIntMethod.
Now let's examine the second batch of results; these call `getpid`,
`exp` and `qsort` using Panama. The numbers here are considerably lower
than the JNI ones for all the three benchmark - although the first two
seem to be the most problematic. To explain these results we need to
peek under the hood. Panama implements foreign calls through a so called
'universal adapter' which, given a calling scheme and a bunch of
arguments (machine words) shuffles these arguments in the right
registers/stack slot and then jumps to the target native function -
after which another round of adaptation must be performed (e.g. to
recover the return value from the right register/memory location).
Needless to say, all this generality comes at a cost - some of the cost
is in Java - e.g. all arguments have to be packaged up into a long array
(although this component doesn't seem to show up much in the generated
JVM compiled code). A lot of the cost is in the adapter logic itself -
which has to look at the 'call recipe' and move arguments around
accordingly - more specifically, in order to call the native function,
the adapter creates a bunch of helper C++ objects and structs which
model the CPU state (e.g. in the ShuffleDowncallContext struct, we find
a field for each register to be modeled in the target architecture). The
adapter has to first move the values coming from the Java world (stored
in the aforementioned long array) into the right context fields (and it
needs to do so by looking at the recipe, which involves iteration over
the recipe elements). After that's done, we can jump into the assembly
stub that does the native call - this stub will take as input one of
those ShuffleDowncallContext structure and will load the corresponding
registers/create necessary stack slots ahead of the call.
As you can see, there's quite a lot of action going on here, and this
explains the benchmark numbers; of course, if you are calling a native
function that does a lot of computation, this adaptation cost will wash
out - but for relatively quick calls such as 'getpid' and 'exp' the
latency dominates the picture.
Digression: the callback logic suffers pretty much from the same issues,
albeit in a reversed order - this time it's the Java code which receives
a 'snapshot' of the register values from a generated assembly adapter;
the Java code can then read such values (using the Pointer API), turn
them into Java objects, call the target Java method and store the
results (after another conversion) in the right location of the
snapshot. The assembly adapter will then pick up the values set onto the
snapshot by the Java code, store them into the corresponding registers
and return control to the native callee. In the remainder of this email
we will not discuss callbacks in details - we will just posit that for
any optimization technique that can be defined, there exists a _dual_
strategy that works with callbacks.
How can we make sensible native calls go faster? Well, one obvious way
would be to optimize the universal adapter so that we get a specialized
assembly stub for each code shape. If we do that, we can move pretty
much all of the computation described above from execution time to the
stub generation time, so that, by the time we have to call the native
function, we just have to populate the right registers (the specialized
stub knows where to find them) and jump. While this sounds a good
approach, it feels like there's also a move for the JIT somewhere in
there - after all, the JVM knows which calls are hot and in need for
optimization, so perhaps this specialization process (some or all of it)
could happen dynamically. And this is indeed an approach we'd like to
aim for in the long run.
Now, few years ago, Vlad put together a patch which now lives in the
'linkToNative' branch [6, 7] - the goal of this patch is to implement
the approach described above: generate a specialized assembly adapter
for a given native signature, and then leverage the JIT to optimize it
away, turning the adapter into a bare, direct, native method call. As
you can see from the third batch of benchmarks, if we tweak Panama to
use the linkToNative machinery, the speed up is really impressive, and
we end up being much faster than JNI (up to 4x for getPid).
Unfortunately, the technology in the linkToNative branch is not ready
from prime time (yet) - first, it doesn't cover some useful cases (e.g.
varargs, multiple returns via registers, arguments passed in memory).
That is, the logic assumes there's a 1-1 mapping between a Java
signature and the native function to be called - and that the arguments
passed from Java will either be longs or doubles. While we can
workaround this limitation and define the necessary marshalling logic in
Java (as I have done to run this benchmark), some of the limitations
(multiple returns, structs passed by value which are too big) cannot
simply be worked around. But that's fine, we can still have a fast path
for those calls which have certain characteristics and a slow path
(through the universal adapter) for all the other calls.
But there's a second and more serious issue lurking: as you can see in
the benchmark, I was not able to get the qsort benchmark running when
using the linkToNative backend. The reason is that the linkToNative code
is still pretty raw, and it doesn't fully adhere to the JVM internal
conventions - e.g. there are missing thread state transitions which, in
the case of upcalls into Java, create issues when it comes to garbage
collection, as the GC cannot parse the native stack in the correct way.
This means that, while there's a clear shining path ahead of us, it is
simply too early to just use the linkToNative backend from Panama. For
this reason, I've been looking into some kind of stopgap solution -
another way of optimizing native calls (and upcalls into Java) that
doesn't require too much VM magic. Now, a crucial observation is that,
in many native calls, there is indeed a 1-1 mapping between Java
arguments and native arguments (and back, for return values). That is,
we can think of calling a native function as a process that takes a
bunch of Java arguments, turn them into native arguments (either double
or longs), calls the native methods and then turns back the result into
Java.
The mapping between Java arguments and C values is quite simple:
* primitives: either long or double, depending on whether they describe
an integral value or a floating point one.
* pointers: they convert to a long
* callbacks: they also convert to a long
* structs: they are recursively decomposed into fields and each field is
marshalled separately (assuming the struct is not too big, in which case
is passed in memory)
So, in principle, we could define a bunch of native entry points in the
VM, one per shape, which take a bunch of long and doubles and call an
underlying function with those arguments. For instance, let's consider
the case of a native function which is modelled in Java as:
int m(Pointer<Foo>, double)
To call this native function we have to first turn the Java arguments
into a (long, double) pair. Then we need to call a native adapter that
looks like the following:
jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused, jlong addr,
jlong arg0, jdouble arg1) {
return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
}
And this will take care of calling the native function and returning the
value back. This is, admittedly, a very simple solution; of course there
are limitations: we have to define a bunch of specialized native entry
point (and Java entry points, for callbacks). But here we can play a
trick: most of moderns ABI pass arguments in registers; for instance
System V ABI [5] uses up to 6 (!!) integer registers and 7 (!!) MMXr
registers for FP values - this gives us a total of 13 registers
available for argument passing. Which covers quite a lot of cases. Now,
if we have a call where _all_ arguments are passed in registers, then
the order in which these arguments are declared in the adapter doesn't
matter! That is, since FP-values will always be passed in different
register from integral values, we can just define entry points which
look like these:
invokeNative_V_DDDDD
invokeNative_V_JDDDD
invokeNative_V_JJDDD
invokeNative_V_JJJDD
invokeNative_V_JJJJD
invokeNative_V_JJJJJ
That is, for a given arity (5 in this case), we can just put all long
arguments in front, and the double arguments after that. That is, we
don't need to generate all possible permutations of J/D in all positions
- as the adapter will always do the same thing (read: load from same
registers) for all equivalent combinations. This keeps the number of
entry points in check - and it also poses some challenges to the Java
logic in charge of marshalling/unmarshalling, as there's an extra
permutation step involved (although that is not something super-hard to
address).
You can see the performance numbers associated with this invocation
scheme (which I've dubbed 'direct') in the 4th batch of the benchmark
results. These numbers are on par (and slightly better) with JNI in all
the three cases considered which is, I think, a very positive result,
given that to write these benchmarks I did not have to write a single
line of JNI code. In other words, this optimization gives you the same
speed as JNI, with improved ease of use (**).
Now, since the 'direct' optimization builds on top of the VM native call
adapters, this approach is significantly more robust than linkToNative
and I have not run into any weird VM crashes when playing with it. The
downside of that, is that, for obvious reasons, this approach cannot get
much faster than JNI - that is, it cannot get close to the numbers
obtained with the linkToNative backend, which features much deeper
optimizations. But I think that, despite its limitations, it's still a
good opportunistic improvement that is worth pursuing in the short term
(while we sort out the linkToNative story). For this reason, I will soon
be submitting a review which incorporates the changes for the 'direct'
invocation schemes.
Cheers
Maurizio
[1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
[2] - https://github.com/jnr/jnr-ffi
[3] - https://github.com/jnr/jffi
[4] - https://sourceware.org/libffi/
[5] -
https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
[6] - http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
[7] - http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354
(**) the benchmark also contains a 5th row in which I repeated same
tests, this time using JNR [2]. JNR is built on top of libjffi [3], a
JNI library in turn built on top of the popular libffi [4]. I wanted to
have some numbers about JNR because that's another solution that allows
for better ease to use, taking care of marshalling Java values into C
and back; since the goals of JNR are similar in spirit with some of the
goals of the Panama/foreign work, I thought it would be worth having a
comparison of these approaches. For the records, I think the JNR numbers
are very respectable given that JNR had to do all the hard work outside
of the JDK!
More information about the panama-dev
mailing list