[foreign] some JMH benchmarks

Thu Oct 11 01:18:19 UTC 2018

Hi, Maurizio,

To get the ball going, I've updated my benchmark code with sorting examples:
https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69

With the usual 2-cores VM, Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz, 
Ubuntu 14.04, GCC 4.9, OpenJDK 8, I obtained the following:
Benchmark                               Mode  Cnt         Score 
Error  Units
NativeBenchmark.expBenchmark           thrpt    5  37684721.600 ± 
1082945.216  ops/s
NativeBenchmark.getpidBenchmark        thrpt    5  97760579.697 ± 
3559212.842  ops/s
NativeBenchmark.callbackSortBenchmark  thrpt    5    362762.157 ± 
11992.584  ops/s
NativeBenchmark.nativeSortBenchmark    thrpt    5   7218834.171 ± 
461245.346  ops/s
NativeBenchmark.inlineSortBenchmark    thrpt    5  17211735.752 ± 
1032386.799  ops/s

That seems to be consistent with the results you got with JNI on JDK 8:
     http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
     https://bugs.openjdk.java.net/browse/JDK-8210975
Although my callbackSortBenchmark() seems a bit slower. JavaCPP doesn't 
currently support static methods for callback functions, which wouldn't 
be a problem to support, but for now that's probably where the small 
~10% difference comes from.

Anyway, what's important is that nativeSortBenchmark() is ~20 times 
faster than callbackSortBenchmark(), and inlineSortBenchmark() is ~47 
times faster.

In theory, how much faster can link2native make callbackSortBenchmark()? 
Given the results you provided for getpid(), I'm guessing maybe 3 or 4 
times faster, which sounds good, but it's still a far cry from 
nativeSortBenchmark() and especially inlineSortBenchmark(). So, I get 
the impression that it is not possible to make it useful for this kind 
of use case. I would very much like to be proven wrong though.

Samuel

On 09/21/2018 09:51 AM, Samuel Audet wrote:
> Sounds good, thanks for testing this and for filing the bug report!
> 
> Samuel
> 
> On 09/21/2018 03:14 AM, Maurizio Cimadamore wrote:
>> Sorry for the delay in getting back at you. There's indeed something 
>> fishy going on here, and I have spotted a regression in JNI perf since 
>> JDK 11. This could be caused by update in compiler toolchain 
>> introduced in same version, but I have filed an issue for our hotspot 
>> team to investigate:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8210975
>>
>> In the context of this discussion, it's likely that the rtegression is 
>> affecting the numbers of both Panama (which is built on top of JNI at 
>> the moment) and the JNI benchmarks.
>>
>> Thanks
>> Maurizio
>>
>>
>> On 19/09/18 01:13, Samuel Audet wrote:
>>> Thanks! You haven't mentioned the version of the JDK you're using 
>>> though. I'm starting to get the impression that JNI in newer versions 
>>> of OpenJDK will be slower... ?
>>>
>>> On 09/18/2018 07:03 PM, Maurizio Cimadamore wrote:
>>>> These are the numbers I get
>>>>
>>>> Benchmark                         Mode  Cnt         Score Error Units
>>>> NativeBenchmark.expBenchmark     thrpt    5  30542590.094 ± 
>>>> 44126.434  ops/s
>>>> NativeBenchmark.getpidBenchmark  thrpt    5  61764677.092 ± 
>>>> 21102.236  ops/s
>>>>
>>>> They are in the same ballpark, but exp() is a bit faster; byw, I 
>>>> tried to repeat my benchmark with JNI exp() _and_ O3 and I've got 
>>>> very similar numbers (yesterday I did a very quick test and there 
>>>> was probably some other job running on the machine and brining down 
>>>> the figures a bit).
>>>>
>>>> But overall, the results in your bench seem to match what I got: exp 
>>>> is faster, pid is slower, the difference is mostly caused by O3. If 
>>>> no O3 is used, then the numbers should match what I included in my 
>>>> numbers (and getpid should be a bit faster).
>>>>
>>>> Maurizio
>>>>
>>>>
>>>> On 18/09/18 05:48, Samuel Audet wrote:
>>>>> Anyway, I've put online an updated version of my benchmark files here:
>>>>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>>>> Just run "git clone" on the URL and run "mvn package" on the pom.xml.
>>>>>
>>>>> With the 2 virtual cores of an Intel(R) Xeon(R) CPU E5-2673 v4 @ 
>>>>> 2.30GHz running Ubuntu 14.04 on the cloud with GCC 4.9 and OpenJDK 
>>>>> 8, I get these numbers:
>>>>>
>>>>> Benchmark                         Mode  Cnt          Score Error 
>>>>>  Units
>>>>> NativeBenchmark.expBenchmark     thrpt   25   37460540.440 ± 
>>>>> 393299.974  ops/s
>>>>> NativeBenchmark.getpidBenchmark  thrpt   25  100323188.451 ± 
>>>>> 1254197.449  ops/s
>>>>>
>>>>> While on my laptop, an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz 
>>>>> running Fedora 27, GCC 7.3, and OpenJDK 9, I get the following:
>>>>>
>>>>> Benchmark                         Mode  Cnt         Score Error Units
>>>>> NativeBenchmark.expBenchmark     thrpt   25  50047147.099 ± 
>>>>> 924366.937 ops/s
>>>>> NativeBenchmark.getpidBenchmark  thrpt   25   4825508.193 ± 
>>>>> 21662.633 ops/s
>>>>>
>>>>> Now, it looks like getpid() is really slow on Fedora 27 for some 
>>>>> reason, but as Linus puts it, we should not be using that for 
>>>>> benchmarking:
>>>>> https://yarchive.net/comp/linux/getpid_caching.html
>>>>>
>>>>> What do you get on your machines?
>>>>>
>>>>> Samuel
>>>>>
>>>>>
>>>>> On 09/18/2018 12:58 AM, Maurizio Cimadamore wrote:
>>>>>> For the records, here's what I get for all the three benchmarks if 
>>>>>> I compile the JNI code with -O3:
>>>>>>
>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>> PanamaBenchmark.testJNIExp        thrpt    5  28575269.294 ± 
>>>>>> 1907726.710  ops/s
>>>>>> PanamaBenchmark.testJNIJavaQsort  thrpt    5    372148.433 ± 
>>>>>> 27178.529  ops/s
>>>>>> PanamaBenchmark.testJNIPid        thrpt    5  59240069.011 ± 
>>>>>> 403881.697  ops/s
>>>>>>
>>>>>> The first and second benchmarks get faster and very close to the 
>>>>>> 'direct' optimization numbers in [1]. Surprisingly, the last 
>>>>>> benchmark (getpid) is quite slower. I've been able to reproduce 
>>>>>> across multiple runs; for that benchmark omitting O3 seems to be 
>>>>>> the achieve best results, not sure why. It starts of faster 
>>>>>> (around in the first couple of warmup iterations, but then it goes 
>>>>>> slower in all the other runs - presumably it interacts badly with 
>>>>>> the C2 generated code. For instance, this is a run with O3 enabled:
>>>>>>
>>>>>> # Run progress: 66.67% complete, ETA 00:01:40
>>>>>> # Fork: 1 of 1
>>>>>> # Warmup Iteration   1: 65182202.653 ops/s
>>>>>> # Warmup Iteration   2: 64900639.094 ops/s
>>>>>> # Warmup Iteration   3: 59314945.437 ops/s 
>>>>>> <---------------------------------
>>>>>> # Warmup Iteration   4: 59269007.877 ops/s
>>>>>> # Warmup Iteration   5: 59239905.163 ops/s
>>>>>> Iteration   1: 59300748.074 ops/s
>>>>>> Iteration   2: 59249666.044 ops/s
>>>>>> Iteration   3: 59268597.051 ops/s
>>>>>> Iteration   4: 59322074.572 ops/s
>>>>>> Iteration   5: 59059259.317 ops/s
>>>>>>
>>>>>> And this is a run with O3 disabled:
>>>>>>
>>>>>> # Run progress: 0.00% complete, ETA 00:01:40
>>>>>> # Fork: 1 of 1
>>>>>> # Warmup Iteration   1: 55882128.787 ops/s
>>>>>> # Warmup Iteration   2: 53102361.751 ops/s
>>>>>> # Warmup Iteration   3: 66964755.699 ops/s 
>>>>>> <---------------------------------
>>>>>> # Warmup Iteration   4: 66414428.355 ops/s
>>>>>> # Warmup Iteration   5: 65328475.276 ops/s
>>>>>> Iteration   1: 64229192.993 ops/s
>>>>>> Iteration   2: 65191719.319 ops/s
>>>>>> Iteration   3: 65352022.471 ops/s
>>>>>> Iteration   4: 65152090.426 ops/s
>>>>>> Iteration   5: 65320545.712 ops/s
>>>>>>
>>>>>>
>>>>>> In both cases, the 3rd warmup execution sees a performance jump - 
>>>>>> with O3, the jump is backwards, w/o O3 the jump is forward, which 
>>>>>> is quite typical for a JMH benchmark as C2 optimization will start 
>>>>>> to kick in.
>>>>>>
>>>>>> For these reasons, I'm reluctant to update my benchmark numbers to 
>>>>>> reflect the O3 behavior (although I agree that, since the Hotspot 
>>>>>> code is compiled with that optimization it would make more sense 
>>>>>> to use that as a reference).
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17/09/18 16:18, Maurizio Cimadamore wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 17/09/18 15:08, Samuel Audet wrote:
>>>>>>>> Yes, the blackhole or the random number doesn't make any 
>>>>>>>> difference, but not calling gcc with -O3 does. Running the 
>>>>>>>> compiler with optimizations on is pretty common, but they are 
>>>>>>>> not enabled by default.
>>>>>>> A bit better
>>>>>>>
>>>>>>> PanamaBenchmark.testMethod  thrpt    5  28018170.076 ± 
>>>>>>> 8491668.248 ops/s
>>>>>>>
>>>>>>> But not much of a difference (I did not expected much, as the 
>>>>>>> body of the native method is extremely simple).
>>>>>>>
>>>>>>> Maurizio 
>>>>>
>>>>
>>>
>>
>