[foreign] Poor performance?

Sat May 18 16:42:53 UTC 2019

Users could bind `malloc` and `free` and use those instead. But, 
allocation really isn't the problem here... The allocation using Scope 
in the current link2native implementation is actually really fast, since 
it allocates a slab of 64KB when creating a Scope (which is reused 
between benchmark calls), and the actual allocations per benchmark call 
are just pointer bumps [1] (until a new slab needs to be allocated).

I re-ran the benchmark with malloc as well, but switching to malloc 
degrades performance (on Windows). What does significantly improve 
performance is removing the call to get the field:

Benchmark                                              Mode  Cnt     
Score     Error  Units
JmhCallOnly.jni_javacpp                                avgt   50    
64.804 ▒   2.588  ns/op
JmhCallOnly.jni_javacpp_getonly                        avgt   50    
45.543 ▒   1.876  ns/op
JmhCallOnly.panama                                     avgt   50    
38.244 ▒   1.496  ns/op
JmhCallOnly.panama_getonly                             avgt   50   
530.956 ▒  29.321  ns/op
JmhGetSystemTimeSeconds.jni_javacpp                    avgt   50   
309.768 ▒  14.556  ns/op
JmhGetSystemTimeSeconds.jni_javacpp_noget              avgt   50   
243.865 ▒  10.380  ns/op
JmhGetSystemTimeSeconds.panama                         avgt   50  
4769.212 ▒ 273.042  ns/op
JmhGetSystemTimeSeconds.panama_prelayout               avgt   50   
608.144 ▒  26.004  ns/op
JmhGetSystemTimeSeconds.panama_prelayout_malloc        avgt   50   
711.237 ▒  33.311  ns/op
JmhGetSystemTimeSeconds.panama_prelayout_malloc_noget  avgt   50   
104.144 ▒   4.195  ns/op
JmhGetSystemTimeSeconds.panama_prelayout_noget         avgt   50    
64.545 ▒   3.848  ns/op

Note in particular the `JmhCallOnly.panama_getonly` results compared to 
`JmhGetSystemTimeSeconds.panama_prelayout_noget`. The relevant code for 
`JmhCallOnly.panama_getonly` is just:

     private static final Scope scope = kernel32_h.scope().fork();
     private static final LayoutType<_SYSTEMTIME> systemtimeLayout = 
LayoutType.ofStruct(_SYSTEMTIME.class);
     private Pointer<_SYSTEMTIME> preallocatedSystemTime;

     public PanamaBenchmark() {
         preallocatedSystemTime = scope.allocate(systemtimeLayout);
     }

     public short getOnly() { // JmhCallOnly.panama_getonly
         return preallocatedSystemTime.get().wSecond$get();
     }

So the real bottleneck seems to be the field get. But, let's investigate 
further, since we're doing both a get of the struct object, and then a 
get of the field. Let's split this into a get of the struct, and a get 
of a pre-computed _SYSTEMTIME object:

     private static final Scope scope = kernel32_h.scope().fork();
     private static final LayoutType<_SYSTEMTIME> systemtimeLayout = 
LayoutType.ofStruct(_SYSTEMTIME.class);
     private Pointer<_SYSTEMTIME> preallocatedSystemTime;
     private _SYSTEMTIME struct;

     public PanamaBenchmark() {
         preallocatedSystemTime = scope.allocate(systemtimeLayout);
         struct = preallocatedSystemTime.get();
     }

     public short getOnly() { // JmhCallOnly.panama_getonly
         return preallocatedSystemTime.get().wSecond$get();
     }

     public short getOnlyFieldDirect() { // 
JmhCallOnly.panama_getfield_only
         return struct.wSecond$get();
     }

     public Object getStructOnly() { // JmhCallOnly.panama_getstruct_only
         return preallocatedSystemTime.get();
     }

Benchmark                          Mode  Cnt    Score    Error  Units
JmhCallOnly.jni_javacpp_getonly    avgt   50   48.642 ▒  1.917  ns/op
JmhCallOnly.panama_getfield_only   avgt   50   93.360 ▒ 13.364  ns/op
JmhCallOnly.panama_getonly         avgt   50  533.978 ▒ 24.249  ns/op
JmhCallOnly.panama_getstruct_only  avgt   50  377.114 ▒ 19.884  ns/op

So part of the performance loss goes to getting the field, which creates 
a bunch of intermediate Pointer objects (see RuntimeSupport::CasterImpl) 
[2]. I think the new memaccess API could really help there, since we can 
pre-compute a VarHandle for the field, and shouldn't need any of these 
intermediate pointer objects.

But, by far the largest part of the time seems to go to creating the 
_SYSTEMTIME object when calling get() on the `Pointer<_SYSTEMTIME>`, 
which corresponds to References.OfStruct::get [3]:

     static Struct<?> get(Pointer<?> pointer) {
         ((BoundedPointer<?>)pointer).checkAlive();
         Class<?> carrier = 
((LayoutTypeImpl<?>)pointer.type()).carrier();
         Class<?> structClass = 
LibrariesHelper.getStructImplClass(carrier);
         try {
             return 
(Struct<?>)structClass.getConstructor(Pointer.class).newInstance(pointer);
         } catch (ReflectiveOperationException ex) {
             throw new IllegalStateException(ex);
         }
     }

I once had the idea to try and see what specialization of this code on a 
per-Struct-class basis cloud do for performance. Maybe now is a good 
time to try it out ;)

> If Panama doesn't allow them to use raw pointers with
> layouts without going through hoops, that's a usability problem, and
> if those issues are not ironed out eventually, people will be forced
> to keep using JNI.

I think we are far from being set-in-stone, at least as far as the 
high-level API goes. I agree that there should be DYI options for doing 
things. I think the current solution being investigated is to have 
multiple levels of public API (with memaccess, FFI, and then Panama), 
instead of just one high-level API that everyone uses. So users would 
have the option to use the low-level APIs to build their own solution 
from scratch.

Jorn

[1] : 
http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/ScopeImpl.java#l216
[2] : 
http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/RuntimeSupport.java#l60
[3] : 
http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/memory/References.java#l524

Samuel Audet schreef op 2019-05-18 12:04:
> If I understand correctly memory allocation is the culprit? Is there a
> way to call something like malloc() with Panama and still be able to
> map it to layouts and/or cast it to whatever we want? Calling malloc()
> with JavaCPP won't do anything w.r.t to deallocation, scopes,
> cleaners, etc, but it's available as an option, because sometimes
> users need it! If Panama doesn't allow them to use raw pointers with
> layouts without going through hoops, that's a usability problem, and
> if those issues are not ironed out eventually, people will be forced
> to keep using JNI.
> 
> Samuel
> 
> On 5/18/19 12:33 AM, Maurizio Cimadamore wrote:
>> Thanks Jorn,
>> I'd be more interested in knowing the raw native call numbers, does it 
>> get any better with linkToNative? Here I'd be expecting performances 
>> identical to JNI (since the binder should lower the Pointer to a long, 
>> which LinkToNative would then pass by register).
>> 
>> As for the fuller benchmark, note that you are also measuring the 
>> performances of Scope::allocate, which is internally using some maps. 
>> JNR/JNI does not do the same liveliness checks that we do, so the full 
>> benchmark is not totally fair. But the arw performance of the downcall 
>> should be an apple-to-apple comparison, and it shouldn't be 8x slower 
>> as it is now (at least not with linkToNative).
>> 
>> Maurizio
>> 
>> 
>> On 17/05/2019 16:14, Jorn Vernee wrote:
>> 
>>> FWIW, I ran the benchmarks with the linkToNative back-end (using 
>>> -Djdk.internal.foreign.NativeInvoker.FASTPATH=direct), but it's still 
>>> 2x slower than JNI:
>>> 
>>> Benchmark                                   Mode  Cnt Score     Error 
>>> Units
>>> JmhGetSystemTimeSeconds.jni_javacpp         avgt   50   298.046 ▒ 
>>> 15.744  ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout    avgt   50   596.567 ▒ 
>>> 20.570  ns/op
>>> 
>>> Of course, like Aleksey says: "The numbers [above] are just data. To 
>>> gain reusable insights, you need to follow up on why the numbers are 
>>> the way they are.". Unfortunately, I'm having some trouble getting 
>>> the project to work with the Windows profiler :/ Was currently 
>>> looking into that.
>>> 
>>> Cheers,
>>> Jorn
>>> 
>>> Maurizio Cimadamore schreef op 2019-05-17 16:51:
>>>> On 17/05/2019 11:26, Maurizio Cimadamore wrote:
>>>>> thanks you for bringing this up, I saw this benchmark few days ago 
>>>>> and I took a look at it. That benchmark is unfortunately hitting on 
>>>>> a couple of (transitory!) pain points: (1) it is running on 
>>>>> Windows, which lacks the optimizations available for MacOS and 
>>>>> Linux (directInvoker). When the linkToNative effort will be 
>>>>> completed, this discrepancy between platforms will go away. The 
>>>>> second problem (2) is that the call is passing a big struct (e.g. 
>>>>> bigger than 64 bits). Even on Linux and Mac, such a call would be 
>>>>> unable to take advantage of the optimized invoker and would fall 
>>>>> back to the so called 'universal invoker' which is slow.
>>>> 
>>>> Actually, my bad, the bench is passing pointer to structs, not 
>>>> structs
>>>> by value - which I think should mean the 'foreign+linkToNative'
>>>> experimental branch should be able to handle this. Would be nice to
>>>> get some confirmation that this is indeed the case.
>>>> 
>>>> Maurizio