[foreign] Poor performance?

Mon May 20 08:17:16 UTC 2019

On 20/05/2019 07:13, Samuel Audet wrote:
> Hi, Jorn,
>
> Thanks for the followup!
>
> I see, there's still optimization to be done for layouts. Though, as I 
> previously pointed out, I'm concerned that relying on APIs like 
> MethodHandle and VarHandle is going to prevent these optimizations 
> from working with AOT compilers. Substrate VM already has something 
> for "memaccess" and "FFI", so it would make sense to have a 
> "high-level API" that works with those, and I'm guessing that Panama 
> and Graal are going to work together to define that high-level API 
> eventually, but it would be great to have clear direction about what 
> the plan is.
>
> Still, neither Panama nor SVM have started looking at mapping inline 
> functions or C++ templates to something like custom JVM intrinsics. 
> That's the kind of level I would really like for a team like Panama to 
> focus on. Then it doesn't matter what the speed is for getters, 
> setters, and whatever, we would just need to create inline functions 
> and map everything that way. :)

Points noted - but let's keep this thread focused please.

Maurizio

>
> Samuel
>
> On 5/20/19 12:05 AM, Jorn Vernee wrote:
>> Some followup on this.
>>
>> I've tested a patch that specializes the getter MethodHandle per 
>> struct class (rough [1]), and the good news is that this pretty much 
>> completely removes the overhead from the struct getter:
>>
>> Benchmark                          Mode  Cnt    Score    Error Units
>> JmhCallOnly.jni_javacpp_getonly    avgt   50   45.704 ▒  1.448 ns/op
>> JmhCallOnly.panama_getfield_only   avgt   50   87.393 ▒  7.810 ns/op
>> JmhCallOnly.panama_getonly         avgt   50  101.654 ▒ 13.549 ns/op
>> JmhCallOnly.panama_getstruct_only  avgt   50   13.036 ▒  0.648 ns/op
>>
>> Upon inspection, most of the time was spent on the reflective 
>> constructor lookup and call. Now the field access is the largest part 
>> of the time spent (which should go down with memaccess as well I think).
>>
>> The bad news... since we (apparently) can't get a MethodHandle to the 
>> constructor of our Struct impl class without triggering an access 
>> violation (seemingly because it's a VMAC?), I had to disable the 
>> access checking to get these numbers.
>>
>> But, intuitively, we should be able to get that MethodHandle without 
>> disabling the access checks, since it's fine to call newInstance 
>> reflectively in the current implementation as well, why wouldn't we 
>> be able to do the same through a MethodHandle? I'm not sure...
>>
>> Jorn
>>
>> [1] : 
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/getstruct/webrev.00/
>>
>> Jorn Vernee schreef op 2019-05-18 18:42:
>>> Users could bind `malloc` and `free` and use those instead. But,
>>> allocation really isn't the problem here... The allocation using Scope
>>> in the current link2native implementation is actually really fast,
>>> since it allocates a slab of 64KB when creating a Scope (which is
>>> reused between benchmark calls), and the actual allocations per
>>> benchmark call are just pointer bumps [1] (until a new slab needs to
>>> be allocated).
>>>
>>> I re-ran the benchmark with malloc as well, but switching to malloc
>>> degrades performance (on Windows). What does significantly improve
>>> performance is removing the call to get the field:
>>>
>>> Benchmark                                              Mode Cnt 
>>> Score     Error  Units
>>> JmhCallOnly.jni_javacpp                                avgt 50 
>>> 64.804 ▒   2.588  ns/op
>>> JmhCallOnly.jni_javacpp_getonly                        avgt 50 
>>> 45.543 ▒   1.876  ns/op
>>> JmhCallOnly.panama                                     avgt 50 
>>> 38.244 ▒   1.496  ns/op
>>> JmhCallOnly.panama_getonly                             avgt 50 
>>> 530.956 ▒  29.321  ns/op
>>> JmhGetSystemTimeSeconds.jni_javacpp                    avgt 50 
>>> 309.768 ▒  14.556  ns/op
>>> JmhGetSystemTimeSeconds.jni_javacpp_noget              avgt 50 
>>> 243.865 ▒  10.380  ns/op
>>> JmhGetSystemTimeSeconds.panama                         avgt 50 
>>> 4769.212 ▒ 273.042  ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout               avgt 50 
>>> 608.144 ▒  26.004  ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout_malloc        avgt 50 
>>> 711.237 ▒  33.311  ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout_malloc_noget  avgt 50 
>>> 104.144 ▒   4.195  ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout_noget         avgt 50 
>>> 64.545 ▒   3.848  ns/op
>>>
>>> Note in particular the `JmhCallOnly.panama_getonly` results compared
>>> to `JmhGetSystemTimeSeconds.panama_prelayout_noget`. The relevant code
>>> for `JmhCallOnly.panama_getonly` is just:
>>>
>>>     private static final Scope scope = kernel32_h.scope().fork();
>>>     private static final LayoutType<_SYSTEMTIME> systemtimeLayout = 
>>> LayoutType.ofStruct(_SYSTEMTIME.class);
>>>     private Pointer<_SYSTEMTIME> preallocatedSystemTime;
>>>
>>>     public PanamaBenchmark() {
>>>         preallocatedSystemTime = scope.allocate(systemtimeLayout);
>>>     }
>>>
>>>     public short getOnly() { // JmhCallOnly.panama_getonly
>>>         return preallocatedSystemTime.get().wSecond$get();
>>>     }
>>>
>>> So the real bottleneck seems to be the field get. But, let's
>>> investigate further, since we're doing both a get of the struct
>>> object, and then a get of the field. Let's split this into a get of
>>> the struct, and a get of a pre-computed _SYSTEMTIME object:
>>>
>>>     private static final Scope scope = kernel32_h.scope().fork();
>>>     private static final LayoutType<_SYSTEMTIME> systemtimeLayout = 
>>> LayoutType.ofStruct(_SYSTEMTIME.class);
>>>     private Pointer<_SYSTEMTIME> preallocatedSystemTime;
>>>     private _SYSTEMTIME struct;
>>>
>>>     public PanamaBenchmark() {
>>>         preallocatedSystemTime = scope.allocate(systemtimeLayout);
>>>         struct = preallocatedSystemTime.get();
>>>     }
>>>
>>>     public short getOnly() { // JmhCallOnly.panama_getonly
>>>         return preallocatedSystemTime.get().wSecond$get();
>>>     }
>>>
>>>     public short getOnlyFieldDirect() { // 
>>> JmhCallOnly.panama_getfield_only
>>>         return struct.wSecond$get();
>>>     }
>>>
>>>     public Object getStructOnly() { // 
>>> JmhCallOnly.panama_getstruct_only
>>>         return preallocatedSystemTime.get();
>>>     }
>>>
>>> Benchmark                          Mode  Cnt    Score Error  Units
>>> JmhCallOnly.jni_javacpp_getonly    avgt   50   48.642 ▒ 1.917  ns/op
>>> JmhCallOnly.panama_getfield_only   avgt   50   93.360 ▒ 13.364  ns/op
>>> JmhCallOnly.panama_getonly         avgt   50  533.978 ▒ 24.249  ns/op
>>> JmhCallOnly.panama_getstruct_only  avgt   50  377.114 ▒ 19.884  ns/op
>>>
>>> So part of the performance loss goes to getting the field, which
>>> creates a bunch of intermediate Pointer objects (see
>>> RuntimeSupport::CasterImpl) [2]. I think the new memaccess API could
>>> really help there, since we can pre-compute a VarHandle for the field,
>>> and shouldn't need any of these intermediate pointer objects.
>>>
>>> But, by far the largest part of the time seems to go to creating the
>>> _SYSTEMTIME object when calling get() on the `Pointer<_SYSTEMTIME>`,
>>> which corresponds to References.OfStruct::get [3]:
>>>
>>>     static Struct<?> get(Pointer<?> pointer) {
>>>         ((BoundedPointer<?>)pointer).checkAlive();
>>>         Class<?> carrier = 
>>> ((LayoutTypeImpl<?>)pointer.type()).carrier();
>>>         Class<?> structClass = 
>>> LibrariesHelper.getStructImplClass(carrier);
>>>         try {
>>>             return 
>>> (Struct<?>)structClass.getConstructor(Pointer.class).newInstance(pointer); 
>>>
>>>         } catch (ReflectiveOperationException ex) {
>>>             throw new IllegalStateException(ex);
>>>         }
>>>     }
>>>
>>> I once had the idea to try and see what specialization of this code on
>>> a per-Struct-class basis cloud do for performance. Maybe now is a good
>>> time to try it out ;)
>>>
>>>> If Panama doesn't allow them to use raw pointers with
>>>> layouts without going through hoops, that's a usability problem, and
>>>> if those issues are not ironed out eventually, people will be forced
>>>> to keep using JNI.
>>>
>>> I think we are far from being set-in-stone, at least as far as the
>>> high-level API goes. I agree that there should be DYI options for
>>> doing things. I think the current solution being investigated is to
>>> have multiple levels of public API (with memaccess, FFI, and then
>>> Panama), instead of just one high-level API that everyone uses. So
>>> users would have the option to use the low-level APIs to build their
>>> own solution from scratch.
>>>
>>> Jorn
>>>
>>> [1] : 
>>> http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/ScopeImpl.java#l216 
>>> [2] : 
>>> http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/RuntimeSupport.java#l60 
>>> [3] : 
>>> http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/memory/References.java#l524 
>>>
>>>
>>> Samuel Audet schreef op 2019-05-18 12:04:
>>>> If I understand correctly memory allocation is the culprit? Is there a
>>>> way to call something like malloc() with Panama and still be able to
>>>> map it to layouts and/or cast it to whatever we want? Calling malloc()
>>>> with JavaCPP won't do anything w.r.t to deallocation, scopes,
>>>> cleaners, etc, but it's available as an option, because sometimes
>>>> users need it! If Panama doesn't allow them to use raw pointers with
>>>> layouts without going through hoops, that's a usability problem, and
>>>> if those issues are not ironed out eventually, people will be forced
>>>> to keep using JNI.
>>>>
>>>> Samuel
>>>>
>>>> On 5/18/19 12:33 AM, Maurizio Cimadamore wrote:
>>>>> Thanks Jorn,
>>>>> I'd be more interested in knowing the raw native call numbers, 
>>>>> does it get any better with linkToNative? Here I'd be expecting 
>>>>> performances identical to JNI (since the binder should lower the 
>>>>> Pointer to a long, which LinkToNative would then pass by register).
>>>>>
>>>>> As for the fuller benchmark, note that you are also measuring the 
>>>>> performances of Scope::allocate, which is internally using some 
>>>>> maps. JNR/JNI does not do the same liveliness checks that we do, 
>>>>> so the full benchmark is not totally fair. But the arw performance 
>>>>> of the downcall should be an apple-to-apple comparison, and it 
>>>>> shouldn't be 8x slower as it is now (at least not with linkToNative).
>>>>>
>>>>> Maurizio
>>>>>
>>>>>
>>>>> On 17/05/2019 16:14, Jorn Vernee wrote:
>>>>>
>>>>>> FWIW, I ran the benchmarks with the linkToNative back-end (using 
>>>>>> -Djdk.internal.foreign.NativeInvoker.FASTPATH=direct), but it's 
>>>>>> still 2x slower than JNI:
>>>>>>
>>>>>> Benchmark                                   Mode  Cnt Score Error 
>>>>>> Units
>>>>>> JmhGetSystemTimeSeconds.jni_javacpp         avgt   50 298.046 ▒ 
>>>>>> 15.744  ns/op
>>>>>> JmhGetSystemTimeSeconds.panama_prelayout    avgt   50 596.567 ▒ 
>>>>>> 20.570  ns/op
>>>>>>
>>>>>> Of course, like Aleksey says: "The numbers [above] are just data. 
>>>>>> To gain reusable insights, you need to follow up on why the 
>>>>>> numbers are the way they are.". Unfortunately, I'm having some 
>>>>>> trouble getting the project to work with the Windows profiler :/ 
>>>>>> Was currently looking into that.
>>>>>>
>>>>>> Cheers,
>>>>>> Jorn
>>>>>>
>>>>>> Maurizio Cimadamore schreef op 2019-05-17 16:51:
>>>>>>> On 17/05/2019 11:26, Maurizio Cimadamore wrote:
>>>>>>>> thanks you for bringing this up, I saw this benchmark few days 
>>>>>>>> ago and I took a look at it. That benchmark is unfortunately 
>>>>>>>> hitting on a couple of (transitory!) pain points: (1) it is 
>>>>>>>> running on Windows, which lacks the optimizations available for 
>>>>>>>> MacOS and Linux (directInvoker). When the linkToNative effort 
>>>>>>>> will be completed, this discrepancy between platforms will go 
>>>>>>>> away. The second problem (2) is that the call is passing a big 
>>>>>>>> struct (e.g. bigger than 64 bits). Even on Linux and Mac, such 
>>>>>>>> a call would be unable to take advantage of the optimized 
>>>>>>>> invoker and would fall back to the so called 'universal 
>>>>>>>> invoker' which is slow.
>>>>>>>
>>>>>>> Actually, my bad, the bench is passing pointer to structs, not 
>>>>>>> structs
>>>>>>> by value - which I think should mean the 'foreign+linkToNative'
>>>>>>> experimental branch should be able to handle this. Would be nice to
>>>>>>> get some confirmation that this is indeed the case.
>>>>>>>
>>>>>>> Maurizio
>