[foreign] Poor performance?

Mon May 20 06:13:30 UTC 2019

Hi, Jorn,

Thanks for the followup!

I see, there's still optimization to be done for layouts. Though, as I 
previously pointed out, I'm concerned that relying on APIs like 
MethodHandle and VarHandle is going to prevent these optimizations from 
working with AOT compilers. Substrate VM already has something for 
"memaccess" and "FFI", so it would make sense to have a "high-level API" 
that works with those, and I'm guessing that Panama and Graal are going 
to work together to define that high-level API eventually, but it would 
be great to have clear direction about what the plan is.

Still, neither Panama nor SVM have started looking at mapping inline 
functions or C++ templates to something like custom JVM intrinsics. 
That's the kind of level I would really like for a team like Panama to 
focus on. Then it doesn't matter what the speed is for getters, setters, 
and whatever, we would just need to create inline functions and map 
everything that way. :)

Samuel

On 5/20/19 12:05 AM, Jorn Vernee wrote:
> Some followup on this.
> 
> I've tested a patch that specializes the getter MethodHandle per struct 
> class (rough [1]), and the good news is that this pretty much completely 
> removes the overhead from the struct getter:
> 
> Benchmark                          Mode  Cnt    Score    Error  Units
> JmhCallOnly.jni_javacpp_getonly    avgt   50   45.704 ▒  1.448  ns/op
> JmhCallOnly.panama_getfield_only   avgt   50   87.393 ▒  7.810  ns/op
> JmhCallOnly.panama_getonly         avgt   50  101.654 ▒ 13.549  ns/op
> JmhCallOnly.panama_getstruct_only  avgt   50   13.036 ▒  0.648  ns/op
> 
> Upon inspection, most of the time was spent on the reflective 
> constructor lookup and call. Now the field access is the largest part of 
> the time spent (which should go down with memaccess as well I think).
> 
> The bad news... since we (apparently) can't get a MethodHandle to the 
> constructor of our Struct impl class without triggering an access 
> violation (seemingly because it's a VMAC?), I had to disable the access 
> checking to get these numbers.
> 
> But, intuitively, we should be able to get that MethodHandle without 
> disabling the access checks, since it's fine to call newInstance 
> reflectively in the current implementation as well, why wouldn't we be 
> able to do the same through a MethodHandle? I'm not sure...
> 
> Jorn
> 
> [1] : 
> http://cr.openjdk.java.net/~jvernee/panama/webrevs/getstruct/webrev.00/
> 
> Jorn Vernee schreef op 2019-05-18 18:42:
>> Users could bind `malloc` and `free` and use those instead. But,
>> allocation really isn't the problem here... The allocation using Scope
>> in the current link2native implementation is actually really fast,
>> since it allocates a slab of 64KB when creating a Scope (which is
>> reused between benchmark calls), and the actual allocations per
>> benchmark call are just pointer bumps [1] (until a new slab needs to
>> be allocated).
>>
>> I re-ran the benchmark with malloc as well, but switching to malloc
>> degrades performance (on Windows). What does significantly improve
>> performance is removing the call to get the field:
>>
>> Benchmark                                              Mode  Cnt Score     Error  Units
>> JmhCallOnly.jni_javacpp                                avgt   50 64.804 ▒   2.588  ns/op
>> JmhCallOnly.jni_javacpp_getonly                        avgt   50 45.543 ▒   1.876  ns/op
>> JmhCallOnly.panama                                     avgt   50 38.244 ▒   1.496  ns/op
>> JmhCallOnly.panama_getonly                             avgt   50 530.956 ▒  29.321  ns/op
>> JmhGetSystemTimeSeconds.jni_javacpp                    avgt   50 309.768 ▒  14.556  ns/op
>> JmhGetSystemTimeSeconds.jni_javacpp_noget              avgt   50 243.865 ▒  10.380  ns/op
>> JmhGetSystemTimeSeconds.panama                         avgt   50 4769.212 ▒ 273.042  ns/op
>> JmhGetSystemTimeSeconds.panama_prelayout               avgt   50 608.144 ▒  26.004  ns/op
>> JmhGetSystemTimeSeconds.panama_prelayout_malloc        avgt   50 711.237 ▒  33.311  ns/op
>> JmhGetSystemTimeSeconds.panama_prelayout_malloc_noget  avgt   50 104.144 ▒   4.195  ns/op
>> JmhGetSystemTimeSeconds.panama_prelayout_noget         avgt   50 64.545 ▒   3.848  ns/op
>>
>> Note in particular the `JmhCallOnly.panama_getonly` results compared
>> to `JmhGetSystemTimeSeconds.panama_prelayout_noget`. The relevant code
>> for `JmhCallOnly.panama_getonly` is just:
>>
>>     private static final Scope scope = kernel32_h.scope().fork();
>>     private static final LayoutType<_SYSTEMTIME> systemtimeLayout = LayoutType.ofStruct(_SYSTEMTIME.class);
>>     private Pointer<_SYSTEMTIME> preallocatedSystemTime;
>>
>>     public PanamaBenchmark() {
>>         preallocatedSystemTime = scope.allocate(systemtimeLayout);
>>     }
>>
>>     public short getOnly() { // JmhCallOnly.panama_getonly
>>         return preallocatedSystemTime.get().wSecond$get();
>>     }
>>
>> So the real bottleneck seems to be the field get. But, let's
>> investigate further, since we're doing both a get of the struct
>> object, and then a get of the field. Let's split this into a get of
>> the struct, and a get of a pre-computed _SYSTEMTIME object:
>>
>>     private static final Scope scope = kernel32_h.scope().fork();
>>     private static final LayoutType<_SYSTEMTIME> systemtimeLayout = LayoutType.ofStruct(_SYSTEMTIME.class);
>>     private Pointer<_SYSTEMTIME> preallocatedSystemTime;
>>     private _SYSTEMTIME struct;
>>
>>     public PanamaBenchmark() {
>>         preallocatedSystemTime = scope.allocate(systemtimeLayout);
>>         struct = preallocatedSystemTime.get();
>>     }
>>
>>     public short getOnly() { // JmhCallOnly.panama_getonly
>>         return preallocatedSystemTime.get().wSecond$get();
>>     }
>>
>>     public short getOnlyFieldDirect() { // JmhCallOnly.panama_getfield_only
>>         return struct.wSecond$get();
>>     }
>>
>>     public Object getStructOnly() { // JmhCallOnly.panama_getstruct_only
>>         return preallocatedSystemTime.get();
>>     }
>>
>> Benchmark                          Mode  Cnt    Score    Error  Units
>> JmhCallOnly.jni_javacpp_getonly    avgt   50   48.642 ▒  1.917  ns/op
>> JmhCallOnly.panama_getfield_only   avgt   50   93.360 ▒ 13.364  ns/op
>> JmhCallOnly.panama_getonly         avgt   50  533.978 ▒ 24.249  ns/op
>> JmhCallOnly.panama_getstruct_only  avgt   50  377.114 ▒ 19.884  ns/op
>>
>> So part of the performance loss goes to getting the field, which
>> creates a bunch of intermediate Pointer objects (see
>> RuntimeSupport::CasterImpl) [2]. I think the new memaccess API could
>> really help there, since we can pre-compute a VarHandle for the field,
>> and shouldn't need any of these intermediate pointer objects.
>>
>> But, by far the largest part of the time seems to go to creating the
>> _SYSTEMTIME object when calling get() on the `Pointer<_SYSTEMTIME>`,
>> which corresponds to References.OfStruct::get [3]:
>>
>>     static Struct<?> get(Pointer<?> pointer) {
>>         ((BoundedPointer<?>)pointer).checkAlive();
>>         Class<?> carrier = ((LayoutTypeImpl<?>)pointer.type()).carrier();
>>         Class<?> structClass = LibrariesHelper.getStructImplClass(carrier);
>>         try {
>>             return (Struct<?>)structClass.getConstructor(Pointer.class).newInstance(pointer); 
>>
>>         } catch (ReflectiveOperationException ex) {
>>             throw new IllegalStateException(ex);
>>         }
>>     }
>>
>> I once had the idea to try and see what specialization of this code on
>> a per-Struct-class basis cloud do for performance. Maybe now is a good
>> time to try it out ;)
>>
>>> If Panama doesn't allow them to use raw pointers with
>>> layouts without going through hoops, that's a usability problem, and
>>> if those issues are not ironed out eventually, people will be forced
>>> to keep using JNI.
>>
>> I think we are far from being set-in-stone, at least as far as the
>> high-level API goes. I agree that there should be DYI options for
>> doing things. I think the current solution being investigated is to
>> have multiple levels of public API (with memaccess, FFI, and then
>> Panama), instead of just one high-level API that everyone uses. So
>> users would have the option to use the low-level APIs to build their
>> own solution from scratch.
>>
>> Jorn
>>
>> [1] : http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/ScopeImpl.java#l216 
>> [2] : http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/RuntimeSupport.java#l60 
>> [3] : http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/memory/References.java#l524 
>>
>>
>> Samuel Audet schreef op 2019-05-18 12:04:
>>> If I understand correctly memory allocation is the culprit? Is there a
>>> way to call something like malloc() with Panama and still be able to
>>> map it to layouts and/or cast it to whatever we want? Calling malloc()
>>> with JavaCPP won't do anything w.r.t to deallocation, scopes,
>>> cleaners, etc, but it's available as an option, because sometimes
>>> users need it! If Panama doesn't allow them to use raw pointers with
>>> layouts without going through hoops, that's a usability problem, and
>>> if those issues are not ironed out eventually, people will be forced
>>> to keep using JNI.
>>>
>>> Samuel
>>>
>>> On 5/18/19 12:33 AM, Maurizio Cimadamore wrote:
>>>> Thanks Jorn,
>>>> I'd be more interested in knowing the raw native call numbers, does 
>>>> it get any better with linkToNative? Here I'd be expecting 
>>>> performances identical to JNI (since the binder should lower the 
>>>> Pointer to a long, which LinkToNative would then pass by register).
>>>>
>>>> As for the fuller benchmark, note that you are also measuring the 
>>>> performances of Scope::allocate, which is internally using some 
>>>> maps. JNR/JNI does not do the same liveliness checks that we do, so 
>>>> the full benchmark is not totally fair. But the arw performance of 
>>>> the downcall should be an apple-to-apple comparison, and it 
>>>> shouldn't be 8x slower as it is now (at least not with linkToNative).
>>>>
>>>> Maurizio
>>>>
>>>>
>>>> On 17/05/2019 16:14, Jorn Vernee wrote:
>>>>
>>>>> FWIW, I ran the benchmarks with the linkToNative back-end (using 
>>>>> -Djdk.internal.foreign.NativeInvoker.FASTPATH=direct), but it's 
>>>>> still 2x slower than JNI:
>>>>>
>>>>> Benchmark                                   Mode  Cnt Score Error 
>>>>> Units
>>>>> JmhGetSystemTimeSeconds.jni_javacpp         avgt   50   298.046 ▒ 15.744  ns/op
>>>>> JmhGetSystemTimeSeconds.panama_prelayout    avgt   50   596.567 ▒ 20.570  ns/op
>>>>>
>>>>> Of course, like Aleksey says: "The numbers [above] are just data. 
>>>>> To gain reusable insights, you need to follow up on why the numbers 
>>>>> are the way they are.". Unfortunately, I'm having some trouble 
>>>>> getting the project to work with the Windows profiler :/ Was 
>>>>> currently looking into that.
>>>>>
>>>>> Cheers,
>>>>> Jorn
>>>>>
>>>>> Maurizio Cimadamore schreef op 2019-05-17 16:51:
>>>>>> On 17/05/2019 11:26, Maurizio Cimadamore wrote:
>>>>>>> thanks you for bringing this up, I saw this benchmark few days 
>>>>>>> ago and I took a look at it. That benchmark is unfortunately 
>>>>>>> hitting on a couple of (transitory!) pain points: (1) it is 
>>>>>>> running on Windows, which lacks the optimizations available for 
>>>>>>> MacOS and Linux (directInvoker). When the linkToNative effort 
>>>>>>> will be completed, this discrepancy between platforms will go 
>>>>>>> away. The second problem (2) is that the call is passing a big 
>>>>>>> struct (e.g. bigger than 64 bits). Even on Linux and Mac, such a 
>>>>>>> call would be unable to take advantage of the optimized invoker 
>>>>>>> and would fall back to the so called 'universal invoker' which is 
>>>>>>> slow.
>>>>>>
>>>>>> Actually, my bad, the bench is passing pointer to structs, not structs
>>>>>> by value - which I think should mean the 'foreign+linkToNative'
>>>>>> experimental branch should be able to handle this. Would be nice to
>>>>>> get some confirmation that this is indeed the case.
>>>>>>
>>>>>> Maurizio