[foreign] some JMH benchmarks

Fri Sep 21 00:51:13 UTC 2018

Sounds good, thanks for testing this and for filing the bug report!

Samuel

On 09/21/2018 03:14 AM, Maurizio Cimadamore wrote:
> Sorry for the delay in getting back at you. There's indeed something 
> fishy going on here, and I have spotted a regression in JNI perf since 
> JDK 11. This could be caused by update in compiler toolchain introduced 
> in same version, but I have filed an issue for our hotspot team to 
> investigate:
> 
> https://bugs.openjdk.java.net/browse/JDK-8210975
> 
> In the context of this discussion, it's likely that the rtegression is 
> affecting the numbers of both Panama (which is built on top of JNI at 
> the moment) and the JNI benchmarks.
> 
> Thanks
> Maurizio
> 
> 
> On 19/09/18 01:13, Samuel Audet wrote:
>> Thanks! You haven't mentioned the version of the JDK you're using 
>> though. I'm starting to get the impression that JNI in newer versions 
>> of OpenJDK will be slower... ?
>>
>> On 09/18/2018 07:03 PM, Maurizio Cimadamore wrote:
>>> These are the numbers I get
>>>
>>> Benchmark                         Mode  Cnt         Score Error Units
>>> NativeBenchmark.expBenchmark     thrpt    5  30542590.094 ± 
>>> 44126.434  ops/s
>>> NativeBenchmark.getpidBenchmark  thrpt    5  61764677.092 ± 
>>> 21102.236  ops/s
>>>
>>> They are in the same ballpark, but exp() is a bit faster; byw, I 
>>> tried to repeat my benchmark with JNI exp() _and_ O3 and I've got 
>>> very similar numbers (yesterday I did a very quick test and there was 
>>> probably some other job running on the machine and brining down the 
>>> figures a bit).
>>>
>>> But overall, the results in your bench seem to match what I got: exp 
>>> is faster, pid is slower, the difference is mostly caused by O3. If 
>>> no O3 is used, then the numbers should match what I included in my 
>>> numbers (and getpid should be a bit faster).
>>>
>>> Maurizio
>>>
>>>
>>> On 18/09/18 05:48, Samuel Audet wrote:
>>>> Anyway, I've put online an updated version of my benchmark files here:
>>>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>>> Just run "git clone" on the URL and run "mvn package" on the pom.xml.
>>>>
>>>> With the 2 virtual cores of an Intel(R) Xeon(R) CPU E5-2673 v4 @ 
>>>> 2.30GHz running Ubuntu 14.04 on the cloud with GCC 4.9 and OpenJDK 
>>>> 8, I get these numbers:
>>>>
>>>> Benchmark                         Mode  Cnt          Score Error  Units
>>>> NativeBenchmark.expBenchmark     thrpt   25   37460540.440 ± 
>>>> 393299.974  ops/s
>>>> NativeBenchmark.getpidBenchmark  thrpt   25  100323188.451 ± 
>>>> 1254197.449  ops/s
>>>>
>>>> While on my laptop, an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz 
>>>> running Fedora 27, GCC 7.3, and OpenJDK 9, I get the following:
>>>>
>>>> Benchmark                         Mode  Cnt         Score Error Units
>>>> NativeBenchmark.expBenchmark     thrpt   25  50047147.099 ± 
>>>> 924366.937 ops/s
>>>> NativeBenchmark.getpidBenchmark  thrpt   25   4825508.193 ± 
>>>> 21662.633 ops/s
>>>>
>>>> Now, it looks like getpid() is really slow on Fedora 27 for some 
>>>> reason, but as Linus puts it, we should not be using that for 
>>>> benchmarking:
>>>> https://yarchive.net/comp/linux/getpid_caching.html
>>>>
>>>> What do you get on your machines?
>>>>
>>>> Samuel
>>>>
>>>>
>>>> On 09/18/2018 12:58 AM, Maurizio Cimadamore wrote:
>>>>> For the records, here's what I get for all the three benchmarks if 
>>>>> I compile the JNI code with -O3:
>>>>>
>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>> PanamaBenchmark.testJNIExp        thrpt    5  28575269.294 ± 
>>>>> 1907726.710  ops/s
>>>>> PanamaBenchmark.testJNIJavaQsort  thrpt    5    372148.433 ± 
>>>>> 27178.529  ops/s
>>>>> PanamaBenchmark.testJNIPid        thrpt    5  59240069.011 ± 
>>>>> 403881.697  ops/s
>>>>>
>>>>> The first and second benchmarks get faster and very close to the 
>>>>> 'direct' optimization numbers in [1]. Surprisingly, the last 
>>>>> benchmark (getpid) is quite slower. I've been able to reproduce 
>>>>> across multiple runs; for that benchmark omitting O3 seems to be 
>>>>> the achieve best results, not sure why. It starts of faster (around 
>>>>> in the first couple of warmup iterations, but then it goes slower 
>>>>> in all the other runs - presumably it interacts badly with the C2 
>>>>> generated code. For instance, this is a run with O3 enabled:
>>>>>
>>>>> # Run progress: 66.67% complete, ETA 00:01:40
>>>>> # Fork: 1 of 1
>>>>> # Warmup Iteration   1: 65182202.653 ops/s
>>>>> # Warmup Iteration   2: 64900639.094 ops/s
>>>>> # Warmup Iteration   3: 59314945.437 ops/s 
>>>>> <---------------------------------
>>>>> # Warmup Iteration   4: 59269007.877 ops/s
>>>>> # Warmup Iteration   5: 59239905.163 ops/s
>>>>> Iteration   1: 59300748.074 ops/s
>>>>> Iteration   2: 59249666.044 ops/s
>>>>> Iteration   3: 59268597.051 ops/s
>>>>> Iteration   4: 59322074.572 ops/s
>>>>> Iteration   5: 59059259.317 ops/s
>>>>>
>>>>> And this is a run with O3 disabled:
>>>>>
>>>>> # Run progress: 0.00% complete, ETA 00:01:40
>>>>> # Fork: 1 of 1
>>>>> # Warmup Iteration   1: 55882128.787 ops/s
>>>>> # Warmup Iteration   2: 53102361.751 ops/s
>>>>> # Warmup Iteration   3: 66964755.699 ops/s 
>>>>> <---------------------------------
>>>>> # Warmup Iteration   4: 66414428.355 ops/s
>>>>> # Warmup Iteration   5: 65328475.276 ops/s
>>>>> Iteration   1: 64229192.993 ops/s
>>>>> Iteration   2: 65191719.319 ops/s
>>>>> Iteration   3: 65352022.471 ops/s
>>>>> Iteration   4: 65152090.426 ops/s
>>>>> Iteration   5: 65320545.712 ops/s
>>>>>
>>>>>
>>>>> In both cases, the 3rd warmup execution sees a performance jump - 
>>>>> with O3, the jump is backwards, w/o O3 the jump is forward, which 
>>>>> is quite typical for a JMH benchmark as C2 optimization will start 
>>>>> to kick in.
>>>>>
>>>>> For these reasons, I'm reluctant to update my benchmark numbers to 
>>>>> reflect the O3 behavior (although I agree that, since the Hotspot 
>>>>> code is compiled with that optimization it would make more sense to 
>>>>> use that as a reference).
>>>>>
>>>>> Maurizio
>>>>>
>>>>> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>>>>>
>>>>>
>>>>>
>>>>> On 17/09/18 16:18, Maurizio Cimadamore wrote:
>>>>>>
>>>>>>
>>>>>> On 17/09/18 15:08, Samuel Audet wrote:
>>>>>>> Yes, the blackhole or the random number doesn't make any 
>>>>>>> difference, but not calling gcc with -O3 does. Running the 
>>>>>>> compiler with optimizations on is pretty common, but they are not 
>>>>>>> enabled by default.
>>>>>> A bit better
>>>>>>
>>>>>> PanamaBenchmark.testMethod  thrpt    5  28018170.076 ± 8491668.248 
>>>>>> ops/s
>>>>>>
>>>>>> But not much of a difference (I did not expected much, as the body 
>>>>>> of the native method is extremely simple).
>>>>>>
>>>>>> Maurizio 
>>>>
>>>
>>
>