[foreign] some JMH benchmarks

Thu Oct 11 07:59:47 UTC 2018

Thanks for the numbers - we are aware of the performance difference 
between pure native callback case (your nativeSortBench) vs. JNI 
callback one (callbackSortBench); The cost you are seeing there is 
multiplied, as the compare function is repeatedly called by the qsort 
logic (not just once as in getpid). E.g. if qsort calls the compare 
function 5 times, the cost you see is ~5x of the cost for doing a single 
upcall.

On top of that, the cost for doing an upcall is higher than the one for 
doing a downcall; as I was playing with the hotspot code the other day, 
I noted that Java calling conventions are arranged in such a way so that 
making a JNI call can generally be achieved with no argument shuffling 
(e.g. java register arg # = c register arg # + 1). That is, the downcall 
JNI bridge can simply insert the JNIEnv parameter in the first C 
register, and jump - which makes for a very quick adaptation overhead, 
even in the absence of C2 optimizations.

This choice of course pays for downcalls, but when you walk through the 
details of implementing effective upcall support, it's easy to note how 
that's working against you: when doing an upcall you have to shift all 
registers  up by one position before being able to call the Java code, 
which makes for a more cumbersome shuffling code.

I think fixing mismatches like these would go a long way in making 
performances more symmetric between upcalls and downcalls, but of course 
that's mostly speculation at this point.

Maurizio

On 11/10/18 02:18, Samuel Audet wrote:
> Hi, Maurizio,
>
> To get the ball going, I've updated my benchmark code with sorting 
> examples:
> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>
> With the usual 2-cores VM, Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz, 
> Ubuntu 14.04, GCC 4.9, OpenJDK 8, I obtained the following:
> Benchmark                               Mode  Cnt         Score Error  
> Units
> NativeBenchmark.expBenchmark           thrpt    5  37684721.600 ± 
> 1082945.216  ops/s
> NativeBenchmark.getpidBenchmark        thrpt    5  97760579.697 ± 
> 3559212.842  ops/s
> NativeBenchmark.callbackSortBenchmark  thrpt    5    362762.157 ± 
> 11992.584  ops/s
> NativeBenchmark.nativeSortBenchmark    thrpt    5   7218834.171 ± 
> 461245.346  ops/s
> NativeBenchmark.inlineSortBenchmark    thrpt    5  17211735.752 ± 
> 1032386.799  ops/s
>
> That seems to be consistent with the results you got with JNI on JDK 8:
>     http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>     https://bugs.openjdk.java.net/browse/JDK-8210975
> Although my callbackSortBenchmark() seems a bit slower. JavaCPP 
> doesn't currently support static methods for callback functions, which 
> wouldn't be a problem to support, but for now that's probably where 
> the small ~10% difference comes from.
>
> Anyway, what's important is that nativeSortBenchmark() is ~20 times 
> faster than callbackSortBenchmark(), and inlineSortBenchmark() is ~47 
> times faster.
>
> In theory, how much faster can link2native make 
> callbackSortBenchmark()? Given the results you provided for getpid(), 
> I'm guessing maybe 3 or 4 times faster, which sounds good, but it's 
> still a far cry from nativeSortBenchmark() and especially 
> inlineSortBenchmark(). So, I get the impression that it is not 
> possible to make it useful for this kind of use case. I would very 
> much like to be proven wrong though.
>
> Samuel
>
> On 09/21/2018 09:51 AM, Samuel Audet wrote:
>> Sounds good, thanks for testing this and for filing the bug report!
>>
>> Samuel
>>
>> On 09/21/2018 03:14 AM, Maurizio Cimadamore wrote:
>>> Sorry for the delay in getting back at you. There's indeed something 
>>> fishy going on here, and I have spotted a regression in JNI perf 
>>> since JDK 11. This could be caused by update in compiler toolchain 
>>> introduced in same version, but I have filed an issue for our 
>>> hotspot team to investigate:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8210975
>>>
>>> In the context of this discussion, it's likely that the rtegression 
>>> is affecting the numbers of both Panama (which is built on top of 
>>> JNI at the moment) and the JNI benchmarks.
>>>
>>> Thanks
>>> Maurizio
>>>
>>>
>>> On 19/09/18 01:13, Samuel Audet wrote:
>>>> Thanks! You haven't mentioned the version of the JDK you're using 
>>>> though. I'm starting to get the impression that JNI in newer 
>>>> versions of OpenJDK will be slower... ?
>>>>
>>>> On 09/18/2018 07:03 PM, Maurizio Cimadamore wrote:
>>>>> These are the numbers I get
>>>>>
>>>>> Benchmark                         Mode  Cnt         Score Error Units
>>>>> NativeBenchmark.expBenchmark     thrpt    5  30542590.094 ± 
>>>>> 44126.434  ops/s
>>>>> NativeBenchmark.getpidBenchmark  thrpt    5  61764677.092 ± 
>>>>> 21102.236  ops/s
>>>>>
>>>>> They are in the same ballpark, but exp() is a bit faster; byw, I 
>>>>> tried to repeat my benchmark with JNI exp() _and_ O3 and I've got 
>>>>> very similar numbers (yesterday I did a very quick test and there 
>>>>> was probably some other job running on the machine and brining 
>>>>> down the figures a bit).
>>>>>
>>>>> But overall, the results in your bench seem to match what I got: 
>>>>> exp is faster, pid is slower, the difference is mostly caused by 
>>>>> O3. If no O3 is used, then the numbers should match what I 
>>>>> included in my numbers (and getpid should be a bit faster).
>>>>>
>>>>> Maurizio
>>>>>
>>>>>
>>>>> On 18/09/18 05:48, Samuel Audet wrote:
>>>>>> Anyway, I've put online an updated version of my benchmark files 
>>>>>> here:
>>>>>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>>>>> Just run "git clone" on the URL and run "mvn package" on the 
>>>>>> pom.xml.
>>>>>>
>>>>>> With the 2 virtual cores of an Intel(R) Xeon(R) CPU E5-2673 v4 @ 
>>>>>> 2.30GHz running Ubuntu 14.04 on the cloud with GCC 4.9 and 
>>>>>> OpenJDK 8, I get these numbers:
>>>>>>
>>>>>> Benchmark                         Mode  Cnt Score Error  Units
>>>>>> NativeBenchmark.expBenchmark     thrpt   25 37460540.440 ± 
>>>>>> 393299.974  ops/s
>>>>>> NativeBenchmark.getpidBenchmark  thrpt   25 100323188.451 ± 
>>>>>> 1254197.449  ops/s
>>>>>>
>>>>>> While on my laptop, an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz 
>>>>>> running Fedora 27, GCC 7.3, and OpenJDK 9, I get the following:
>>>>>>
>>>>>> Benchmark                         Mode  Cnt Score Error Units
>>>>>> NativeBenchmark.expBenchmark     thrpt   25 50047147.099 ± 
>>>>>> 924366.937 ops/s
>>>>>> NativeBenchmark.getpidBenchmark  thrpt   25 4825508.193 ± 
>>>>>> 21662.633 ops/s
>>>>>>
>>>>>> Now, it looks like getpid() is really slow on Fedora 27 for some 
>>>>>> reason, but as Linus puts it, we should not be using that for 
>>>>>> benchmarking:
>>>>>> https://yarchive.net/comp/linux/getpid_caching.html
>>>>>>
>>>>>> What do you get on your machines?
>>>>>>
>>>>>> Samuel
>>>>>>
>>>>>>
>>>>>> On 09/18/2018 12:58 AM, Maurizio Cimadamore wrote:
>>>>>>> For the records, here's what I get for all the three benchmarks 
>>>>>>> if I compile the JNI code with -O3:
>>>>>>>
>>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>>> PanamaBenchmark.testJNIExp        thrpt    5 28575269.294 ± 
>>>>>>> 1907726.710  ops/s
>>>>>>> PanamaBenchmark.testJNIJavaQsort  thrpt    5 372148.433 ± 
>>>>>>> 27178.529  ops/s
>>>>>>> PanamaBenchmark.testJNIPid        thrpt    5 59240069.011 ± 
>>>>>>> 403881.697  ops/s
>>>>>>>
>>>>>>> The first and second benchmarks get faster and very close to the 
>>>>>>> 'direct' optimization numbers in [1]. Surprisingly, the last 
>>>>>>> benchmark (getpid) is quite slower. I've been able to reproduce 
>>>>>>> across multiple runs; for that benchmark omitting O3 seems to be 
>>>>>>> the achieve best results, not sure why. It starts of faster 
>>>>>>> (around in the first couple of warmup iterations, but then it 
>>>>>>> goes slower in all the other runs - presumably it interacts 
>>>>>>> badly with the C2 generated code. For instance, this is a run 
>>>>>>> with O3 enabled:
>>>>>>>
>>>>>>> # Run progress: 66.67% complete, ETA 00:01:40
>>>>>>> # Fork: 1 of 1
>>>>>>> # Warmup Iteration   1: 65182202.653 ops/s
>>>>>>> # Warmup Iteration   2: 64900639.094 ops/s
>>>>>>> # Warmup Iteration   3: 59314945.437 ops/s 
>>>>>>> <---------------------------------
>>>>>>> # Warmup Iteration   4: 59269007.877 ops/s
>>>>>>> # Warmup Iteration   5: 59239905.163 ops/s
>>>>>>> Iteration   1: 59300748.074 ops/s
>>>>>>> Iteration   2: 59249666.044 ops/s
>>>>>>> Iteration   3: 59268597.051 ops/s
>>>>>>> Iteration   4: 59322074.572 ops/s
>>>>>>> Iteration   5: 59059259.317 ops/s
>>>>>>>
>>>>>>> And this is a run with O3 disabled:
>>>>>>>
>>>>>>> # Run progress: 0.00% complete, ETA 00:01:40
>>>>>>> # Fork: 1 of 1
>>>>>>> # Warmup Iteration   1: 55882128.787 ops/s
>>>>>>> # Warmup Iteration   2: 53102361.751 ops/s
>>>>>>> # Warmup Iteration   3: 66964755.699 ops/s 
>>>>>>> <---------------------------------
>>>>>>> # Warmup Iteration   4: 66414428.355 ops/s
>>>>>>> # Warmup Iteration   5: 65328475.276 ops/s
>>>>>>> Iteration   1: 64229192.993 ops/s
>>>>>>> Iteration   2: 65191719.319 ops/s
>>>>>>> Iteration   3: 65352022.471 ops/s
>>>>>>> Iteration   4: 65152090.426 ops/s
>>>>>>> Iteration   5: 65320545.712 ops/s
>>>>>>>
>>>>>>>
>>>>>>> In both cases, the 3rd warmup execution sees a performance jump 
>>>>>>> - with O3, the jump is backwards, w/o O3 the jump is forward, 
>>>>>>> which is quite typical for a JMH benchmark as C2 optimization 
>>>>>>> will start to kick in.
>>>>>>>
>>>>>>> For these reasons, I'm reluctant to update my benchmark numbers 
>>>>>>> to reflect the O3 behavior (although I agree that, since the 
>>>>>>> Hotspot code is compiled with that optimization it would make 
>>>>>>> more sense to use that as a reference).
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>> [1] - 
>>>>>>> http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 17/09/18 16:18, Maurizio Cimadamore wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 17/09/18 15:08, Samuel Audet wrote:
>>>>>>>>> Yes, the blackhole or the random number doesn't make any 
>>>>>>>>> difference, but not calling gcc with -O3 does. Running the 
>>>>>>>>> compiler with optimizations on is pretty common, but they are 
>>>>>>>>> not enabled by default.
>>>>>>>> A bit better
>>>>>>>>
>>>>>>>> PanamaBenchmark.testMethod  thrpt    5  28018170.076 ± 
>>>>>>>> 8491668.248 ops/s
>>>>>>>>
>>>>>>>> But not much of a difference (I did not expected much, as the 
>>>>>>>> body of the native method is extremely simple).
>>>>>>>>
>>>>>>>> Maurizio 
>>>>>>
>>>>>
>>>>
>>>
>>
>