[foreign] some JMH benchmarks

Tue Oct 16 11:12:27 UTC 2018

Hum, I think I'm beginning to understand the motivation behind GraalVM...

On 10/16/2018 04:36 PM, Maurizio Cimadamore wrote:
> On 16/10/18 08:24, Samuel Audet wrote:
>> I see, thanks for the explanation about the internals! I was aware of 
>> the performance issue with callbacks and what quicksort entails, but 
>> not about what happens in HotSpot.
>>
>> Given that Panama is doing away with the mandatory arguments for JNI, 
>> would it make sense to make the Java ABI match the native ABI, at 
>> least for methods that interact with native code? I'm sure it would 
>> hurt JNI, but we could provide a VM option to let users decide on a 
>> per application basis. What do you think?
> Thanks for the extra numbers.
> 
> As for changing calling conventions, I'm not sure is feasible (but 
> speaking with my "I'm not a JIT/hotspot expert hat on" ;-) ). As you 
> say, JNI performance could suffer, but most importantly, hotspot has 
> already to maintain a matrix of { interpreted, compiled } x { 
> interpreted, compiled } - e.g. the source/target method could be already 
> compiled by the JIT or not; so adding an extra compiled mode with 
> different calling conventions would mean supporting 6 adaptation flavors 
> (ignoring the 3 trivial identity ones) instead of just 2 - so that would 
> be significant cost for the JVM to swallow.
> 
> Maurizio
>>
>> BTW, I've updated my benchmarks to get some numbers for std::map:
>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>
>> It's not as bad as I thought it was, or put another way, std::map is a 
>> lot of slower than I thought it was, even slower than 
>> java.util.HashMap, so even JNI is kind of usable in this case:
>> NativeBenchmark.sumBenchmark           thrpt    5    550427.311 ± 
>> 12725.331  ops/s
>> NativeBenchmark.nativeSumBenchmark     thrpt    5   1371863.410 ± 
>> 44171.140  ops/s
>> NativeBenchmark.javaSumBenchmark       thrpt    5   1977540.009 ± 
>> 58851.055  ops/s
>>
>> For reference, that's still with 2-cores VM, Intel(R) Xeon(R) CPU 
>> E5-2673 v4 @ 2.30GHz, Ubuntu 14.04, GCC 4.9, OpenJDK 8.
>>
>> Samuel
>>
>> On 10/11/2018 04:59 PM, Maurizio Cimadamore wrote:
>>> Thanks for the numbers - we are aware of the performance difference 
>>> between pure native callback case (your nativeSortBench) vs. JNI 
>>> callback one (callbackSortBench); The cost you are seeing there is 
>>> multiplied, as the compare function is repeatedly called by the qsort 
>>> logic (not just once as in getpid). E.g. if qsort calls the compare 
>>> function 5 times, the cost you see is ~5x of the cost for doing a 
>>> single upcall.
>>>
>>> On top of that, the cost for doing an upcall is higher than the one 
>>> for doing a downcall; as I was playing with the hotspot code the 
>>> other day, I noted that Java calling conventions are arranged in such 
>>> a way so that making a JNI call can generally be achieved with no 
>>> argument shuffling (e.g. java register arg # = c register arg # + 1). 
>>> That is, the downcall JNI bridge can simply insert the JNIEnv 
>>> parameter in the first C register, and jump - which makes for a very 
>>> quick adaptation overhead, even in the absence of C2 optimizations.
>>>
>>> This choice of course pays for downcalls, but when you walk through 
>>> the details of implementing effective upcall support, it's easy to 
>>> note how that's working against you: when doing an upcall you have to 
>>> shift all registers  up by one position before being able to call the 
>>> Java code, which makes for a more cumbersome shuffling code.
>>>
>>> I think fixing mismatches like these would go a long way in making 
>>> performances more symmetric between upcalls and downcalls, but of 
>>> course that's mostly speculation at this point.
>>>
>>> Maurizio
>>>
>>>
>>> On 11/10/18 02:18, Samuel Audet wrote:
>>>> Hi, Maurizio,
>>>>
>>>> To get the ball going, I've updated my benchmark code with sorting 
>>>> examples:
>>>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>>>
>>>> With the usual 2-cores VM, Intel(R) Xeon(R) CPU E5-2673 v4 @ 
>>>> 2.30GHz, Ubuntu 14.04, GCC 4.9, OpenJDK 8, I obtained the following:
>>>> Benchmark                               Mode  Cnt Score Error Units
>>>> NativeBenchmark.expBenchmark           thrpt    5 37684721.600 ± 
>>>> 1082945.216  ops/s
>>>> NativeBenchmark.getpidBenchmark        thrpt    5 97760579.697 ± 
>>>> 3559212.842  ops/s
>>>> NativeBenchmark.callbackSortBenchmark  thrpt    5 362762.157 ± 
>>>> 11992.584  ops/s
>>>> NativeBenchmark.nativeSortBenchmark    thrpt    5 7218834.171 ± 
>>>> 461245.346  ops/s
>>>> NativeBenchmark.inlineSortBenchmark    thrpt    5 17211735.752 ± 
>>>> 1032386.799  ops/s
>>>>
>>>> That seems to be consistent with the results you got with JNI on JDK 8:
>>>> http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>>>>     https://bugs.openjdk.java.net/browse/JDK-8210975
>>>> Although my callbackSortBenchmark() seems a bit slower. JavaCPP 
>>>> doesn't currently support static methods for callback functions, 
>>>> which wouldn't be a problem to support, but for now that's probably 
>>>> where the small ~10% difference comes from.
>>>>
>>>> Anyway, what's important is that nativeSortBenchmark() is ~20 times 
>>>> faster than callbackSortBenchmark(), and inlineSortBenchmark() is 
>>>> ~47 times faster.
>>>>
>>>> In theory, how much faster can link2native make 
>>>> callbackSortBenchmark()? Given the results you provided for 
>>>> getpid(), I'm guessing maybe 3 or 4 times faster, which sounds good, 
>>>> but it's still a far cry from nativeSortBenchmark() and especially 
>>>> inlineSortBenchmark(). So, I get the impression that it is not 
>>>> possible to make it useful for this kind of use case. I would very 
>>>> much like to be proven wrong though.
>>>>
>>>> Samuel
>>>>
>>>> On 09/21/2018 09:51 AM, Samuel Audet wrote:
>>>>> Sounds good, thanks for testing this and for filing the bug report!
>>>>>
>>>>> Samuel
>>>>>
>>>>> On 09/21/2018 03:14 AM, Maurizio Cimadamore wrote:
>>>>>> Sorry for the delay in getting back at you. There's indeed 
>>>>>> something fishy going on here, and I have spotted a regression in 
>>>>>> JNI perf since JDK 11. This could be caused by update in compiler 
>>>>>> toolchain introduced in same version, but I have filed an issue 
>>>>>> for our hotspot team to investigate:
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8210975
>>>>>>
>>>>>> In the context of this discussion, it's likely that the 
>>>>>> rtegression is affecting the numbers of both Panama (which is 
>>>>>> built on top of JNI at the moment) and the JNI benchmarks.
>>>>>>
>>>>>> Thanks
>>>>>> Maurizio
>>>>>>
>>>>>>
>>>>>> On 19/09/18 01:13, Samuel Audet wrote:
>>>>>>> Thanks! You haven't mentioned the version of the JDK you're using 
>>>>>>> though. I'm starting to get the impression that JNI in newer 
>>>>>>> versions of OpenJDK will be slower... ?
>>>>>>>
>>>>>>> On 09/18/2018 07:03 PM, Maurizio Cimadamore wrote:
>>>>>>>> These are the numbers I get
>>>>>>>>
>>>>>>>> Benchmark                         Mode  Cnt Score Error Units
>>>>>>>> NativeBenchmark.expBenchmark     thrpt    5 30542590.094 ± 
>>>>>>>> 44126.434  ops/s
>>>>>>>> NativeBenchmark.getpidBenchmark  thrpt    5 61764677.092 ± 
>>>>>>>> 21102.236  ops/s
>>>>>>>>
>>>>>>>> They are in the same ballpark, but exp() is a bit faster; byw, I 
>>>>>>>> tried to repeat my benchmark with JNI exp() _and_ O3 and I've 
>>>>>>>> got very similar numbers (yesterday I did a very quick test and 
>>>>>>>> there was probably some other job running on the machine and 
>>>>>>>> brining down the figures a bit).
>>>>>>>>
>>>>>>>> But overall, the results in your bench seem to match what I got: 
>>>>>>>> exp is faster, pid is slower, the difference is mostly caused by 
>>>>>>>> O3. If no O3 is used, then the numbers should match what I 
>>>>>>>> included in my numbers (and getpid should be a bit faster).
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>>
>>>>>>>> On 18/09/18 05:48, Samuel Audet wrote:
>>>>>>>>> Anyway, I've put online an updated version of my benchmark 
>>>>>>>>> files here:
>>>>>>>>> https://gist.github.com/saudet/1bf14a000e64c245675cf5d4e9ad6e69
>>>>>>>>> Just run "git clone" on the URL and run "mvn package" on the 
>>>>>>>>> pom.xml.
>>>>>>>>>
>>>>>>>>> With the 2 virtual cores of an Intel(R) Xeon(R) CPU E5-2673 v4 
>>>>>>>>> @ 2.30GHz running Ubuntu 14.04 on the cloud with GCC 4.9 and 
>>>>>>>>> OpenJDK 8, I get these numbers:
>>>>>>>>>
>>>>>>>>> Benchmark                         Mode  Cnt Score Error  Units
>>>>>>>>> NativeBenchmark.expBenchmark     thrpt   25 37460540.440 ± 
>>>>>>>>> 393299.974  ops/s
>>>>>>>>> NativeBenchmark.getpidBenchmark  thrpt   25 100323188.451 ± 
>>>>>>>>> 1254197.449  ops/s
>>>>>>>>>
>>>>>>>>> While on my laptop, an Intel(R) Core(TM) i7-7700HQ CPU @ 
>>>>>>>>> 2.80GHz running Fedora 27, GCC 7.3, and OpenJDK 9, I get the 
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>> Benchmark                         Mode  Cnt Score Error Units
>>>>>>>>> NativeBenchmark.expBenchmark     thrpt   25 50047147.099 ± 
>>>>>>>>> 924366.937 ops/s
>>>>>>>>> NativeBenchmark.getpidBenchmark  thrpt   25 4825508.193 ± 
>>>>>>>>> 21662.633 ops/s
>>>>>>>>>
>>>>>>>>> Now, it looks like getpid() is really slow on Fedora 27 for 
>>>>>>>>> some reason, but as Linus puts it, we should not be using that 
>>>>>>>>> for benchmarking:
>>>>>>>>> https://yarchive.net/comp/linux/getpid_caching.html
>>>>>>>>>
>>>>>>>>> What do you get on your machines?
>>>>>>>>>
>>>>>>>>> Samuel
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 09/18/2018 12:58 AM, Maurizio Cimadamore wrote:
>>>>>>>>>> For the records, here's what I get for all the three 
>>>>>>>>>> benchmarks if I compile the JNI code with -O3:
>>>>>>>>>>
>>>>>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>>>>>> PanamaBenchmark.testJNIExp        thrpt    5 28575269.294 ± 
>>>>>>>>>> 1907726.710  ops/s
>>>>>>>>>> PanamaBenchmark.testJNIJavaQsort  thrpt    5 372148.433 ± 
>>>>>>>>>> 27178.529  ops/s
>>>>>>>>>> PanamaBenchmark.testJNIPid        thrpt    5 59240069.011 ± 
>>>>>>>>>> 403881.697  ops/s
>>>>>>>>>>
>>>>>>>>>> The first and second benchmarks get faster and very close to 
>>>>>>>>>> the 'direct' optimization numbers in [1]. Surprisingly, the 
>>>>>>>>>> last benchmark (getpid) is quite slower. I've been able to 
>>>>>>>>>> reproduce across multiple runs; for that benchmark omitting O3 
>>>>>>>>>> seems to be the achieve best results, not sure why. It starts 
>>>>>>>>>> of faster (around in the first couple of warmup iterations, 
>>>>>>>>>> but then it goes slower in all the other runs - presumably it 
>>>>>>>>>> interacts badly with the C2 generated code. For instance, this 
>>>>>>>>>> is a run with O3 enabled:
>>>>>>>>>>
>>>>>>>>>> # Run progress: 66.67% complete, ETA 00:01:40
>>>>>>>>>> # Fork: 1 of 1
>>>>>>>>>> # Warmup Iteration   1: 65182202.653 ops/s
>>>>>>>>>> # Warmup Iteration   2: 64900639.094 ops/s
>>>>>>>>>> # Warmup Iteration   3: 59314945.437 ops/s 
>>>>>>>>>> <---------------------------------
>>>>>>>>>> # Warmup Iteration   4: 59269007.877 ops/s
>>>>>>>>>> # Warmup Iteration   5: 59239905.163 ops/s
>>>>>>>>>> Iteration   1: 59300748.074 ops/s
>>>>>>>>>> Iteration   2: 59249666.044 ops/s
>>>>>>>>>> Iteration   3: 59268597.051 ops/s
>>>>>>>>>> Iteration   4: 59322074.572 ops/s
>>>>>>>>>> Iteration   5: 59059259.317 ops/s
>>>>>>>>>>
>>>>>>>>>> And this is a run with O3 disabled:
>>>>>>>>>>
>>>>>>>>>> # Run progress: 0.00% complete, ETA 00:01:40
>>>>>>>>>> # Fork: 1 of 1
>>>>>>>>>> # Warmup Iteration   1: 55882128.787 ops/s
>>>>>>>>>> # Warmup Iteration   2: 53102361.751 ops/s
>>>>>>>>>> # Warmup Iteration   3: 66964755.699 ops/s 
>>>>>>>>>> <---------------------------------
>>>>>>>>>> # Warmup Iteration   4: 66414428.355 ops/s
>>>>>>>>>> # Warmup Iteration   5: 65328475.276 ops/s
>>>>>>>>>> Iteration   1: 64229192.993 ops/s
>>>>>>>>>> Iteration   2: 65191719.319 ops/s
>>>>>>>>>> Iteration   3: 65352022.471 ops/s
>>>>>>>>>> Iteration   4: 65152090.426 ops/s
>>>>>>>>>> Iteration   5: 65320545.712 ops/s
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> In both cases, the 3rd warmup execution sees a performance 
>>>>>>>>>> jump - with O3, the jump is backwards, w/o O3 the jump is 
>>>>>>>>>> forward, which is quite typical for a JMH benchmark as C2 
>>>>>>>>>> optimization will start to kick in.
>>>>>>>>>>
>>>>>>>>>> For these reasons, I'm reluctant to update my benchmark 
>>>>>>>>>> numbers to reflect the O3 behavior (although I agree that, 
>>>>>>>>>> since the Hotspot code is compiled with that optimization it 
>>>>>>>>>> would make more sense to use that as a reference).
>>>>>>>>>>
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>> [1] - 
>>>>>>>>>> http://cr.openjdk.java.net/~mcimadamore/panama/foreign-jmh.txt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>