[foreign] some JMH benchmarks
Samuel Audet
samuel.audet at gmail.com
Mon Sep 17 10:00:21 UTC 2018
Thanks for the figures Maurizio! It's finally good to be speaking in
numbers. :)
However, you're not providing a lot of details about how you actually
ran the experiments. So I've decided to run a JMH benchmark on what we
get by default with JavaCPP and this declaration:
@Platform(include = "math.h")
public class MyBenchmark {
static { Loader.load(); }
@NoException
public static native double exp(double x);
@State(Scope.Thread)
public static class MyState {
double x;
@Setup(Level.Iteration)
public void setupMethod() {
x = Math.random();
}
}
@Benchmark
public void testMethod(MyState s, Blackhole bh) {
bh.consume(exp(s.x));
}
}
The relevant portion of generated JNI looks like this:
JNIEXPORT jdouble JNICALL Java_org_sample_MyBenchmark_exp(JNIEnv*
env, jclass cls, jdouble arg0) {
jdouble rarg = 0;
double rval = exp(arg0);
rarg = (jdouble)rval;
return rarg;
}
And with access to just 2 virtual cores of an Intel(R) Xeon(R) CPU
E5-2673 v4 @ 2.30GHz and 8 GB of RAM on the cloud (so probably slower
than your E5-2665 @ 2.40GHz) running Ubuntu 14.04 with GCC 4.9 and
OpenJDK 8, I get these numbers:
Benchmark Mode Cnt Score Error Units
MyBenchmark.testMethod thrpt 25 37183556.094 ± 460795.746 ops/s
I'm not sure how that compares with your numbers exactly, but it does
seem to me that what you get for JNI is a bit low. If you could provide
more details about how to reproduce your results, that would be great.
Samuel
On 09/14/2018 10:19 PM, Maurizio Cimadamore wrote:
> Hi,
> over the last few weeks I've been busy playing with Panama and assessing
> performances with JMH. For those just interested in raw numbers, the
> results of my explorations can be found here [1]. But as all benchmarks,
> I think it's better to spend few words to understand what these numbers
> actually _mean_.
>
> To evaluate the performances of Panama I have first created a baseline
> using JNI - more specifically I wanted to assess performances of three
> calls (all part of the C std library), namely `getpid`, `exp` and `qsort`.
>
> The first example is the de facto benchmark for FFIs - since it does
> relatively little computation, it is a good test to measure the
> 'latency' of the FFI approach (e.g. how long does it take to go to
> native). The second example is also relatively simple, but the this time
> the function takes a double argument. The third test is akin to an FFI
> torture test, since not only it passes substantially more arguments (4)
> but one of these arguments is also a callback - a pointer to a function
> that is used to sort the contents of the input array.
>
> As expected, the first batch of JNI results confirms our expectations:
> `getpid` is the fastest, followed by `exp`, and then followed by
> `qsort`. Note that qsort is not even close in terms of raw numbers to
> the other two tests - that's because, to sort the array we need to do (N
> * log N) upcalls into Java. In the benchmark, N = 8 and we do the
> upcalls using the JNI function JNIEnv::CallIntMethod.
>
> Now let's examine the second batch of results; these call `getpid`,
> `exp` and `qsort` using Panama. The numbers here are considerably lower
> than the JNI ones for all the three benchmark - although the first two
> seem to be the most problematic. To explain these results we need to
> peek under the hood. Panama implements foreign calls through a so called
> 'universal adapter' which, given a calling scheme and a bunch of
> arguments (machine words) shuffles these arguments in the right
> registers/stack slot and then jumps to the target native function -
> after which another round of adaptation must be performed (e.g. to
> recover the return value from the right register/memory location).
>
> Needless to say, all this generality comes at a cost - some of the cost
> is in Java - e.g. all arguments have to be packaged up into a long array
> (although this component doesn't seem to show up much in the generated
> JVM compiled code). A lot of the cost is in the adapter logic itself -
> which has to look at the 'call recipe' and move arguments around
> accordingly - more specifically, in order to call the native function,
> the adapter creates a bunch of helper C++ objects and structs which
> model the CPU state (e.g. in the ShuffleDowncallContext struct, we find
> a field for each register to be modeled in the target architecture). The
> adapter has to first move the values coming from the Java world (stored
> in the aforementioned long array) into the right context fields (and it
> needs to do so by looking at the recipe, which involves iteration over
> the recipe elements). After that's done, we can jump into the assembly
> stub that does the native call - this stub will take as input one of
> those ShuffleDowncallContext structure and will load the corresponding
> registers/create necessary stack slots ahead of the call.
>
> As you can see, there's quite a lot of action going on here, and this
> explains the benchmark numbers; of course, if you are calling a native
> function that does a lot of computation, this adaptation cost will wash
> out - but for relatively quick calls such as 'getpid' and 'exp' the
> latency dominates the picture.
>
> Digression: the callback logic suffers pretty much from the same issues,
> albeit in a reversed order - this time it's the Java code which receives
> a 'snapshot' of the register values from a generated assembly adapter;
> the Java code can then read such values (using the Pointer API), turn
> them into Java objects, call the target Java method and store the
> results (after another conversion) in the right location of the
> snapshot. The assembly adapter will then pick up the values set onto the
> snapshot by the Java code, store them into the corresponding registers
> and return control to the native callee. In the remainder of this email
> we will not discuss callbacks in details - we will just posit that for
> any optimization technique that can be defined, there exists a _dual_
> strategy that works with callbacks.
>
> How can we make sensible native calls go faster? Well, one obvious way
> would be to optimize the universal adapter so that we get a specialized
> assembly stub for each code shape. If we do that, we can move pretty
> much all of the computation described above from execution time to the
> stub generation time, so that, by the time we have to call the native
> function, we just have to populate the right registers (the specialized
> stub knows where to find them) and jump. While this sounds a good
> approach, it feels like there's also a move for the JIT somewhere in
> there - after all, the JVM knows which calls are hot and in need for
> optimization, so perhaps this specialization process (some or all of it)
> could happen dynamically. And this is indeed an approach we'd like to
> aim for in the long run.
>
> Now, few years ago, Vlad put together a patch which now lives in the
> 'linkToNative' branch [6, 7] - the goal of this patch is to implement
> the approach described above: generate a specialized assembly adapter
> for a given native signature, and then leverage the JIT to optimize it
> away, turning the adapter into a bare, direct, native method call. As
> you can see from the third batch of benchmarks, if we tweak Panama to
> use the linkToNative machinery, the speed up is really impressive, and
> we end up being much faster than JNI (up to 4x for getPid).
>
> Unfortunately, the technology in the linkToNative branch is not ready
> from prime time (yet) - first, it doesn't cover some useful cases (e.g.
> varargs, multiple returns via registers, arguments passed in memory).
> That is, the logic assumes there's a 1-1 mapping between a Java
> signature and the native function to be called - and that the arguments
> passed from Java will either be longs or doubles. While we can
> workaround this limitation and define the necessary marshalling logic in
> Java (as I have done to run this benchmark), some of the limitations
> (multiple returns, structs passed by value which are too big) cannot
> simply be worked around. But that's fine, we can still have a fast path
> for those calls which have certain characteristics and a slow path
> (through the universal adapter) for all the other calls.
>
> But there's a second and more serious issue lurking: as you can see in
> the benchmark, I was not able to get the qsort benchmark running when
> using the linkToNative backend. The reason is that the linkToNative code
> is still pretty raw, and it doesn't fully adhere to the JVM internal
> conventions - e.g. there are missing thread state transitions which, in
> the case of upcalls into Java, create issues when it comes to garbage
> collection, as the GC cannot parse the native stack in the correct way.
>
> This means that, while there's a clear shining path ahead of us, it is
> simply too early to just use the linkToNative backend from Panama. For
> this reason, I've been looking into some kind of stopgap solution -
> another way of optimizing native calls (and upcalls into Java) that
> doesn't require too much VM magic. Now, a crucial observation is that,
> in many native calls, there is indeed a 1-1 mapping between Java
> arguments and native arguments (and back, for return values). That is,
> we can think of calling a native function as a process that takes a
> bunch of Java arguments, turn them into native arguments (either double
> or longs), calls the native methods and then turns back the result into
> Java.
>
> The mapping between Java arguments and C values is quite simple:
>
> * primitives: either long or double, depending on whether they describe
> an integral value or a floating point one.
> * pointers: they convert to a long
> * callbacks: they also convert to a long
> * structs: they are recursively decomposed into fields and each field is
> marshalled separately (assuming the struct is not too big, in which case
> is passed in memory)
>
> So, in principle, we could define a bunch of native entry points in the
> VM, one per shape, which take a bunch of long and doubles and call an
> underlying function with those arguments. For instance, let's consider
> the case of a native function which is modelled in Java as:
>
> int m(Pointer<Foo>, double)
>
> To call this native function we have to first turn the Java arguments
> into a (long, double) pair. Then we need to call a native adapter that
> looks like the following:
>
> jlong NI_invokeNative_J_JD(JNIEnv *env, jobject _unused, jlong addr,
> jlong arg0, jdouble arg1) {
> return ((jlong (*)(jlong, jdouble))addr)(arg0, arg1);
> }
>
> And this will take care of calling the native function and returning the
> value back. This is, admittedly, a very simple solution; of course there
> are limitations: we have to define a bunch of specialized native entry
> point (and Java entry points, for callbacks). But here we can play a
> trick: most of moderns ABI pass arguments in registers; for instance
> System V ABI [5] uses up to 6 (!!) integer registers and 7 (!!) MMXr
> registers for FP values - this gives us a total of 13 registers
> available for argument passing. Which covers quite a lot of cases. Now,
> if we have a call where _all_ arguments are passed in registers, then
> the order in which these arguments are declared in the adapter doesn't
> matter! That is, since FP-values will always be passed in different
> register from integral values, we can just define entry points which
> look like these:
>
> invokeNative_V_DDDDD
> invokeNative_V_JDDDD
> invokeNative_V_JJDDD
> invokeNative_V_JJJDD
> invokeNative_V_JJJJD
> invokeNative_V_JJJJJ
>
> That is, for a given arity (5 in this case), we can just put all long
> arguments in front, and the double arguments after that. That is, we
> don't need to generate all possible permutations of J/D in all positions
> - as the adapter will always do the same thing (read: load from same
> registers) for all equivalent combinations. This keeps the number of
> entry points in check - and it also poses some challenges to the Java
> logic in charge of marshalling/unmarshalling, as there's an extra
> permutation step involved (although that is not something super-hard to
> address).
>
> You can see the performance numbers associated with this invocation
> scheme (which I've dubbed 'direct') in the 4th batch of the benchmark
> results. These numbers are on par (and slightly better) with JNI in all
> the three cases considered which is, I think, a very positive result,
> given that to write these benchmarks I did not have to write a single
> line of JNI code. In other words, this optimization gives you the same
> speed as JNI, with improved ease of use (**).
>
> Now, since the 'direct' optimization builds on top of the VM native call
> adapters, this approach is significantly more robust than linkToNative
> and I have not run into any weird VM crashes when playing with it. The
> downside of that, is that, for obvious reasons, this approach cannot get
> much faster than JNI - that is, it cannot get close to the numbers
> obtained with the linkToNative backend, which features much deeper
> optimizations. But I think that, despite its limitations, it's still a
> good opportunistic improvement that is worth pursuing in the short term
> (while we sort out the linkToNative story). For this reason, I will soon
> be submitting a review which incorporates the changes for the 'direct'
> invocation schemes.
>
> Cheers
> Maurizio
>
> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
> [2] - https://github.com/jnr/jnr-ffi
> [3] - https://github.com/jnr/jffi
> [4] - https://sourceware.org/libffi/
> [5] -
> https://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf
>
> [6] - http://cr.openjdk.java.net/~jrose/panama/native-call-primitive.html
> [7] - http://hg.openjdk.java.net/panama/dev/shortlog/b9ebb1bb8354
>
> (**) the benchmark also contains a 5th row in which I repeated same
> tests, this time using JNR [2]. JNR is built on top of libjffi [3], a
> JNI library in turn built on top of the popular libffi [4]. I wanted to
> have some numbers about JNR because that's another solution that allows
> for better ease to use, taking care of marshalling Java values into C
> and back; since the goals of JNR are similar in spirit with some of the
> goals of the Panama/foreign work, I thought it would be worth having a
> comparison of these approaches. For the records, I think the JNR numbers
> are very respectable given that JNR had to do all the hard work outside
> of the JDK!
>
>
More information about the panama-dev
mailing list