[foreign] some JMH benchmarks

Thu Sep 20 18:14:25 UTC 2018

Sorry for the delay in getting back at you. There's indeed something 
fishy going on here, and I have spotted a regression in JNI perf since 
JDK 11. This could be caused by update in compiler toolchain introduced 
in same version, but I have filed an issue for our hotspot team to 
investigate:

https://bugs.openjdk.java.net/browse/JDK-8210975

In the context of this discussion, it's likely that the rtegression is 
affecting the numbers of both Panama (which is built on top of JNI at 
the moment) and the JNI benchmarks.

Thanks
Maurizio

On 19/09/18 01:13, Samuel Audet wrote:
> Thanks! You haven't mentioned the version of the JDK you're using 
> though. I'm starting to get the impression that JNI in newer versions 
> of OpenJDK will be slower... ?
>
> On 09/18/2018 07:03 PM, Maurizio Cimadamore wrote:
>> These are the numbers I get
>>
>> Benchmark                         Mode  Cnt         Score Error Units
>> NativeBenchmark.expBenchmark     thrpt    5  30542590.094 ± 
>> 44126.434  ops/s
>> NativeBenchmark.getpidBenchmark  thrpt    5  61764677.092 ± 
>> 21102.236  ops/s
>>
>> They are in the same ballpark, but exp() is a bit faster; byw, I 
>> tried to repeat my benchmark with JNI exp() _and_ O3 and I've got 
>> very similar numbers (yesterday I did a very quick test and there was 
>> probably some other job running on the machine and brining down the 
>> figures a bit).
>>
>> But overall, the results in your bench seem to match what I got: exp 
>> is faster, pid is slower, the difference is mostly caused by O3. If 
>> no O3 is used, then the numbers should match what I included in my 
>> numbers (and getpid should be a bit faster).
>>
>> Maurizio
>>
>>
>> On 18/09/18 05:48, Samuel Audet wrote:
>>> Anyway, I've put online an updated version of my benchmark files here:
>>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>> Just run "git clone" on the URL and run "mvn package" on the pom.xml.
>>>
>>> With the 2 virtual cores of an Intel(R) Xeon(R) CPU E5-2673 v4 @ 
>>> 2.30GHz running Ubuntu 14.04 on the cloud with GCC 4.9 and OpenJDK 
>>> 8, I get these numbers:
>>>
>>> Benchmark                         Mode  Cnt          Score Error  Units
>>> NativeBenchmark.expBenchmark     thrpt   25   37460540.440 ± 
>>> 393299.974  ops/s
>>> NativeBenchmark.getpidBenchmark  thrpt   25  100323188.451 ± 
>>> 1254197.449  ops/s
>>>
>>> While on my laptop, an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz 
>>> running Fedora 27, GCC 7.3, and OpenJDK 9, I get the following:
>>>
>>> Benchmark                         Mode  Cnt         Score Error Units
>>> NativeBenchmark.expBenchmark     thrpt   25  50047147.099 ± 
>>> 924366.937 ops/s
>>> NativeBenchmark.getpidBenchmark  thrpt   25   4825508.193 ± 
>>> 21662.633 ops/s
>>>
>>> Now, it looks like getpid() is really slow on Fedora 27 for some 
>>> reason, but as Linus puts it, we should not be using that for 
>>> benchmarking:
>>> https://yarchive.net/comp/linux/getpid_caching.html
>>>
>>> What do you get on your machines?
>>>
>>> Samuel
>>>
>>>
>>> On 09/18/2018 12:58 AM, Maurizio Cimadamore wrote:
>>>> For the records, here's what I get for all the three benchmarks if 
>>>> I compile the JNI code with -O3:
>>>>
>>>> Benchmark                          Mode  Cnt Score Error Units
>>>> PanamaBenchmark.testJNIExp        thrpt    5  28575269.294 ± 
>>>> 1907726.710  ops/s
>>>> PanamaBenchmark.testJNIJavaQsort  thrpt    5    372148.433 ± 
>>>> 27178.529  ops/s
>>>> PanamaBenchmark.testJNIPid        thrpt    5  59240069.011 ± 
>>>> 403881.697  ops/s
>>>>
>>>> The first and second benchmarks get faster and very close to the 
>>>> 'direct' optimization numbers in [1]. Surprisingly, the last 
>>>> benchmark (getpid) is quite slower. I've been able to reproduce 
>>>> across multiple runs; for that benchmark omitting O3 seems to be 
>>>> the achieve best results, not sure why. It starts of faster (around 
>>>> in the first couple of warmup iterations, but then it goes slower 
>>>> in all the other runs - presumably it interacts badly with the C2 
>>>> generated code. For instance, this is a run with O3 enabled:
>>>>
>>>> # Run progress: 66.67% complete, ETA 00:01:40
>>>> # Fork: 1 of 1
>>>> # Warmup Iteration   1: 65182202.653 ops/s
>>>> # Warmup Iteration   2: 64900639.094 ops/s
>>>> # Warmup Iteration   3: 59314945.437 ops/s 
>>>> <---------------------------------
>>>> # Warmup Iteration   4: 59269007.877 ops/s
>>>> # Warmup Iteration   5: 59239905.163 ops/s
>>>> Iteration   1: 59300748.074 ops/s
>>>> Iteration   2: 59249666.044 ops/s
>>>> Iteration   3: 59268597.051 ops/s
>>>> Iteration   4: 59322074.572 ops/s
>>>> Iteration   5: 59059259.317 ops/s
>>>>
>>>> And this is a run with O3 disabled:
>>>>
>>>> # Run progress: 0.00% complete, ETA 00:01:40
>>>> # Fork: 1 of 1
>>>> # Warmup Iteration   1: 55882128.787 ops/s
>>>> # Warmup Iteration   2: 53102361.751 ops/s
>>>> # Warmup Iteration   3: 66964755.699 ops/s 
>>>> <---------------------------------
>>>> # Warmup Iteration   4: 66414428.355 ops/s
>>>> # Warmup Iteration   5: 65328475.276 ops/s
>>>> Iteration   1: 64229192.993 ops/s
>>>> Iteration   2: 65191719.319 ops/s
>>>> Iteration   3: 65352022.471 ops/s
>>>> Iteration   4: 65152090.426 ops/s
>>>> Iteration   5: 65320545.712 ops/s
>>>>
>>>>
>>>> In both cases, the 3rd warmup execution sees a performance jump - 
>>>> with O3, the jump is backwards, w/o O3 the jump is forward, which 
>>>> is quite typical for a JMH benchmark as C2 optimization will start 
>>>> to kick in.
>>>>
>>>> For these reasons, I'm reluctant to update my benchmark numbers to 
>>>> reflect the O3 behavior (although I agree that, since the Hotspot 
>>>> code is compiled with that optimization it would make more sense to 
>>>> use that as a reference).
>>>>
>>>> Maurizio
>>>>
>>>> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>>>>
>>>>
>>>>
>>>> On 17/09/18 16:18, Maurizio Cimadamore wrote:
>>>>>
>>>>>
>>>>> On 17/09/18 15:08, Samuel Audet wrote:
>>>>>> Yes, the blackhole or the random number doesn't make any 
>>>>>> difference, but not calling gcc with -O3 does. Running the 
>>>>>> compiler with optimizations on is pretty common, but they are not 
>>>>>> enabled by default.
>>>>> A bit better
>>>>>
>>>>> PanamaBenchmark.testMethod  thrpt    5  28018170.076 ± 8491668.248 
>>>>> ops/s
>>>>>
>>>>> But not much of a difference (I did not expected much, as the body 
>>>>> of the native method is extremely simple).
>>>>>
>>>>> Maurizio 
>>>
>>
>