[foreign] some JMH benchmarks
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Oct 16 07:36:41 UTC 2018
On 16/10/18 08:24, Samuel Audet wrote:
> I see, thanks for the explanation about the internals! I was aware of
> the performance issue with callbacks and what quicksort entails, but
> not about what happens in HotSpot.
>
> Given that Panama is doing away with the mandatory arguments for JNI,
> would it make sense to make the Java ABI match the native ABI, at
> least for methods that interact with native code? I'm sure it would
> hurt JNI, but we could provide a VM option to let users decide on a
> per application basis. What do you think?
Thanks for the extra numbers.
As for changing calling conventions, I'm not sure is feasible (but
speaking with my "I'm not a JIT/hotspot expert hat on" ;-) ). As you
say, JNI performance could suffer, but most importantly, hotspot has
already to maintain a matrix of { interpreted, compiled } x {
interpreted, compiled } - e.g. the source/target method could be already
compiled by the JIT or not; so adding an extra compiled mode with
different calling conventions would mean supporting 6 adaptation flavors
(ignoring the 3 trivial identity ones) instead of just 2 - so that would
be significant cost for the JVM to swallow.
Maurizio
>
> BTW, I've updated my benchmarks to get some numbers for std::map:
> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>
> It's not as bad as I thought it was, or put another way, std::map is a
> lot of slower than I thought it was, even slower than
> java.util.HashMap, so even JNI is kind of usable in this case:
> NativeBenchmark.sumBenchmark thrpt 5 550427.311 ±
> 12725.331 ops/s
> NativeBenchmark.nativeSumBenchmark thrpt 5 1371863.410 ±
> 44171.140 ops/s
> NativeBenchmark.javaSumBenchmark thrpt 5 1977540.009 ±
> 58851.055 ops/s
>
> For reference, that's still with 2-cores VM, Intel(R) Xeon(R) CPU
> E5-2673 v4 @ 2.30GHz, Ubuntu 14.04, GCC 4.9, OpenJDK 8.
>
> Samuel
>
> On 10/11/2018 04:59 PM, Maurizio Cimadamore wrote:
>> Thanks for the numbers - we are aware of the performance difference
>> between pure native callback case (your nativeSortBench) vs. JNI
>> callback one (callbackSortBench); The cost you are seeing there is
>> multiplied, as the compare function is repeatedly called by the qsort
>> logic (not just once as in getpid). E.g. if qsort calls the compare
>> function 5 times, the cost you see is ~5x of the cost for doing a
>> single upcall.
>>
>> On top of that, the cost for doing an upcall is higher than the one
>> for doing a downcall; as I was playing with the hotspot code the
>> other day, I noted that Java calling conventions are arranged in such
>> a way so that making a JNI call can generally be achieved with no
>> argument shuffling (e.g. java register arg # = c register arg # + 1).
>> That is, the downcall JNI bridge can simply insert the JNIEnv
>> parameter in the first C register, and jump - which makes for a very
>> quick adaptation overhead, even in the absence of C2 optimizations.
>>
>> This choice of course pays for downcalls, but when you walk through
>> the details of implementing effective upcall support, it's easy to
>> note how that's working against you: when doing an upcall you have to
>> shift all registers up by one position before being able to call the
>> Java code, which makes for a more cumbersome shuffling code.
>>
>> I think fixing mismatches like these would go a long way in making
>> performances more symmetric between upcalls and downcalls, but of
>> course that's mostly speculation at this point.
>>
>> Maurizio
>>
>>
>> On 11/10/18 02:18, Samuel Audet wrote:
>>> Hi, Maurizio,
>>>
>>> To get the ball going, I've updated my benchmark code with sorting
>>> examples:
>>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>>
>>> With the usual 2-cores VM, Intel(R) Xeon(R) CPU E5-2673 v4 @
>>> 2.30GHz, Ubuntu 14.04, GCC 4.9, OpenJDK 8, I obtained the following:
>>> Benchmark Mode Cnt Score Error Units
>>> NativeBenchmark.expBenchmark thrpt 5 37684721.600 ±
>>> 1082945.216 ops/s
>>> NativeBenchmark.getpidBenchmark thrpt 5 97760579.697 ±
>>> 3559212.842 ops/s
>>> NativeBenchmark.callbackSortBenchmark thrpt 5 362762.157 ±
>>> 11992.584 ops/s
>>> NativeBenchmark.nativeSortBenchmark thrpt 5 7218834.171 ±
>>> 461245.346 ops/s
>>> NativeBenchmark.inlineSortBenchmark thrpt 5 17211735.752 ±
>>> 1032386.799 ops/s
>>>
>>> That seems to be consistent with the results you got with JNI on JDK 8:
>>> http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>>> https://bugs.openjdk.java.net/browse/JDK-8210975
>>> Although my callbackSortBenchmark() seems a bit slower. JavaCPP
>>> doesn't currently support static methods for callback functions,
>>> which wouldn't be a problem to support, but for now that's probably
>>> where the small ~10% difference comes from.
>>>
>>> Anyway, what's important is that nativeSortBenchmark() is ~20 times
>>> faster than callbackSortBenchmark(), and inlineSortBenchmark() is
>>> ~47 times faster.
>>>
>>> In theory, how much faster can link2native make
>>> callbackSortBenchmark()? Given the results you provided for
>>> getpid(), I'm guessing maybe 3 or 4 times faster, which sounds good,
>>> but it's still a far cry from nativeSortBenchmark() and especially
>>> inlineSortBenchmark(). So, I get the impression that it is not
>>> possible to make it useful for this kind of use case. I would very
>>> much like to be proven wrong though.
>>>
>>> Samuel
>>>
>>> On 09/21/2018 09:51 AM, Samuel Audet wrote:
>>>> Sounds good, thanks for testing this and for filing the bug report!
>>>>
>>>> Samuel
>>>>
>>>> On 09/21/2018 03:14 AM, Maurizio Cimadamore wrote:
>>>>> Sorry for the delay in getting back at you. There's indeed
>>>>> something fishy going on here, and I have spotted a regression in
>>>>> JNI perf since JDK 11. This could be caused by update in compiler
>>>>> toolchain introduced in same version, but I have filed an issue
>>>>> for our hotspot team to investigate:
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8210975
>>>>>
>>>>> In the context of this discussion, it's likely that the
>>>>> rtegression is affecting the numbers of both Panama (which is
>>>>> built on top of JNI at the moment) and the JNI benchmarks.
>>>>>
>>>>> Thanks
>>>>> Maurizio
>>>>>
>>>>>
>>>>> On 19/09/18 01:13, Samuel Audet wrote:
>>>>>> Thanks! You haven't mentioned the version of the JDK you're using
>>>>>> though. I'm starting to get the impression that JNI in newer
>>>>>> versions of OpenJDK will be slower... ?
>>>>>>
>>>>>> On 09/18/2018 07:03 PM, Maurizio Cimadamore wrote:
>>>>>>> These are the numbers I get
>>>>>>>
>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>> NativeBenchmark.expBenchmark thrpt 5 30542590.094 ±
>>>>>>> 44126.434 ops/s
>>>>>>> NativeBenchmark.getpidBenchmark thrpt 5 61764677.092 ±
>>>>>>> 21102.236 ops/s
>>>>>>>
>>>>>>> They are in the same ballpark, but exp() is a bit faster; byw, I
>>>>>>> tried to repeat my benchmark with JNI exp() _and_ O3 and I've
>>>>>>> got very similar numbers (yesterday I did a very quick test and
>>>>>>> there was probably some other job running on the machine and
>>>>>>> brining down the figures a bit).
>>>>>>>
>>>>>>> But overall, the results in your bench seem to match what I got:
>>>>>>> exp is faster, pid is slower, the difference is mostly caused by
>>>>>>> O3. If no O3 is used, then the numbers should match what I
>>>>>>> included in my numbers (and getpid should be a bit faster).
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>>
>>>>>>> On 18/09/18 05:48, Samuel Audet wrote:
>>>>>>>> Anyway, I've put online an updated version of my benchmark
>>>>>>>> files here:
>>>>>>>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>>>>>>> Just run "git clone" on the URL and run "mvn package" on the
>>>>>>>> pom.xml.
>>>>>>>>
>>>>>>>> With the 2 virtual cores of an Intel(R) Xeon(R) CPU E5-2673 v4
>>>>>>>> @ 2.30GHz running Ubuntu 14.04 on the cloud with GCC 4.9 and
>>>>>>>> OpenJDK 8, I get these numbers:
>>>>>>>>
>>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>>> NativeBenchmark.expBenchmark thrpt 25 37460540.440 ±
>>>>>>>> 393299.974 ops/s
>>>>>>>> NativeBenchmark.getpidBenchmark thrpt 25 100323188.451 ±
>>>>>>>> 1254197.449 ops/s
>>>>>>>>
>>>>>>>> While on my laptop, an Intel(R) Core(TM) i7-7700HQ CPU @
>>>>>>>> 2.80GHz running Fedora 27, GCC 7.3, and OpenJDK 9, I get the
>>>>>>>> following:
>>>>>>>>
>>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>>> NativeBenchmark.expBenchmark thrpt 25 50047147.099 ±
>>>>>>>> 924366.937 ops/s
>>>>>>>> NativeBenchmark.getpidBenchmark thrpt 25 4825508.193 ±
>>>>>>>> 21662.633 ops/s
>>>>>>>>
>>>>>>>> Now, it looks like getpid() is really slow on Fedora 27 for
>>>>>>>> some reason, but as Linus puts it, we should not be using that
>>>>>>>> for benchmarking:
>>>>>>>> https://yarchive.net/comp/linux/getpid_caching.html
>>>>>>>>
>>>>>>>> What do you get on your machines?
>>>>>>>>
>>>>>>>> Samuel
>>>>>>>>
>>>>>>>>
>>>>>>>> On 09/18/2018 12:58 AM, Maurizio Cimadamore wrote:
>>>>>>>>> For the records, here's what I get for all the three
>>>>>>>>> benchmarks if I compile the JNI code with -O3:
>>>>>>>>>
>>>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>>>> PanamaBenchmark.testJNIExp thrpt 5 28575269.294 ±
>>>>>>>>> 1907726.710 ops/s
>>>>>>>>> PanamaBenchmark.testJNIJavaQsort thrpt 5 372148.433 ±
>>>>>>>>> 27178.529 ops/s
>>>>>>>>> PanamaBenchmark.testJNIPid thrpt 5 59240069.011 ±
>>>>>>>>> 403881.697 ops/s
>>>>>>>>>
>>>>>>>>> The first and second benchmarks get faster and very close to
>>>>>>>>> the 'direct' optimization numbers in [1]. Surprisingly, the
>>>>>>>>> last benchmark (getpid) is quite slower. I've been able to
>>>>>>>>> reproduce across multiple runs; for that benchmark omitting O3
>>>>>>>>> seems to be the achieve best results, not sure why. It starts
>>>>>>>>> of faster (around in the first couple of warmup iterations,
>>>>>>>>> but then it goes slower in all the other runs - presumably it
>>>>>>>>> interacts badly with the C2 generated code. For instance, this
>>>>>>>>> is a run with O3 enabled:
>>>>>>>>>
>>>>>>>>> # Run progress: 66.67% complete, ETA 00:01:40
>>>>>>>>> # Fork: 1 of 1
>>>>>>>>> # Warmup Iteration 1: 65182202.653 ops/s
>>>>>>>>> # Warmup Iteration 2: 64900639.094 ops/s
>>>>>>>>> # Warmup Iteration 3: 59314945.437 ops/s
>>>>>>>>> <---------------------------------
>>>>>>>>> # Warmup Iteration 4: 59269007.877 ops/s
>>>>>>>>> # Warmup Iteration 5: 59239905.163 ops/s
>>>>>>>>> Iteration 1: 59300748.074 ops/s
>>>>>>>>> Iteration 2: 59249666.044 ops/s
>>>>>>>>> Iteration 3: 59268597.051 ops/s
>>>>>>>>> Iteration 4: 59322074.572 ops/s
>>>>>>>>> Iteration 5: 59059259.317 ops/s
>>>>>>>>>
>>>>>>>>> And this is a run with O3 disabled:
>>>>>>>>>
>>>>>>>>> # Run progress: 0.00% complete, ETA 00:01:40
>>>>>>>>> # Fork: 1 of 1
>>>>>>>>> # Warmup Iteration 1: 55882128.787 ops/s
>>>>>>>>> # Warmup Iteration 2: 53102361.751 ops/s
>>>>>>>>> # Warmup Iteration 3: 66964755.699 ops/s
>>>>>>>>> <---------------------------------
>>>>>>>>> # Warmup Iteration 4: 66414428.355 ops/s
>>>>>>>>> # Warmup Iteration 5: 65328475.276 ops/s
>>>>>>>>> Iteration 1: 64229192.993 ops/s
>>>>>>>>> Iteration 2: 65191719.319 ops/s
>>>>>>>>> Iteration 3: 65352022.471 ops/s
>>>>>>>>> Iteration 4: 65152090.426 ops/s
>>>>>>>>> Iteration 5: 65320545.712 ops/s
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> In both cases, the 3rd warmup execution sees a performance
>>>>>>>>> jump - with O3, the jump is backwards, w/o O3 the jump is
>>>>>>>>> forward, which is quite typical for a JMH benchmark as C2
>>>>>>>>> optimization will start to kick in.
>>>>>>>>>
>>>>>>>>> For these reasons, I'm reluctant to update my benchmark
>>>>>>>>> numbers to reflect the O3 behavior (although I agree that,
>>>>>>>>> since the Hotspot code is compiled with that optimization it
>>>>>>>>> would make more sense to use that as a reference).
>>>>>>>>>
>>>>>>>>> Maurizio
>>>>>>>>>
>>>>>>>>> [1] -
>>>>>>>>> http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
More information about the panama-dev
mailing list