[foreign] Poor performance?
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon May 20 08:17:16 UTC 2019
On 20/05/2019 07:13, Samuel Audet wrote:
> Hi, Jorn,
>
> Thanks for the followup!
>
> I see, there's still optimization to be done for layouts. Though, as I
> previously pointed out, I'm concerned that relying on APIs like
> MethodHandle and VarHandle is going to prevent these optimizations
> from working with AOT compilers. Substrate VM already has something
> for "memaccess" and "FFI", so it would make sense to have a
> "high-level API" that works with those, and I'm guessing that Panama
> and Graal are going to work together to define that high-level API
> eventually, but it would be great to have clear direction about what
> the plan is.
>
> Still, neither Panama nor SVM have started looking at mapping inline
> functions or C++ templates to something like custom JVM intrinsics.
> That's the kind of level I would really like for a team like Panama to
> focus on. Then it doesn't matter what the speed is for getters,
> setters, and whatever, we would just need to create inline functions
> and map everything that way. :)
Points noted - but let's keep this thread focused please.
Maurizio
>
> Samuel
>
> On 5/20/19 12:05 AM, Jorn Vernee wrote:
>> Some followup on this.
>>
>> I've tested a patch that specializes the getter MethodHandle per
>> struct class (rough [1]), and the good news is that this pretty much
>> completely removes the overhead from the struct getter:
>>
>> Benchmark Mode Cnt Score Error Units
>> JmhCallOnly.jni_javacpp_getonly avgt 50 45.704 ▒ 1.448 ns/op
>> JmhCallOnly.panama_getfield_only avgt 50 87.393 ▒ 7.810 ns/op
>> JmhCallOnly.panama_getonly avgt 50 101.654 ▒ 13.549 ns/op
>> JmhCallOnly.panama_getstruct_only avgt 50 13.036 ▒ 0.648 ns/op
>>
>> Upon inspection, most of the time was spent on the reflective
>> constructor lookup and call. Now the field access is the largest part
>> of the time spent (which should go down with memaccess as well I think).
>>
>> The bad news... since we (apparently) can't get a MethodHandle to the
>> constructor of our Struct impl class without triggering an access
>> violation (seemingly because it's a VMAC?), I had to disable the
>> access checking to get these numbers.
>>
>> But, intuitively, we should be able to get that MethodHandle without
>> disabling the access checks, since it's fine to call newInstance
>> reflectively in the current implementation as well, why wouldn't we
>> be able to do the same through a MethodHandle? I'm not sure...
>>
>> Jorn
>>
>> [1] :
>> http://cr.openjdk.java.net/~jvernee/panama/webrevs/getstruct/webrev.00/
>>
>> Jorn Vernee schreef op 2019-05-18 18:42:
>>> Users could bind `malloc` and `free` and use those instead. But,
>>> allocation really isn't the problem here... The allocation using Scope
>>> in the current link2native implementation is actually really fast,
>>> since it allocates a slab of 64KB when creating a Scope (which is
>>> reused between benchmark calls), and the actual allocations per
>>> benchmark call are just pointer bumps [1] (until a new slab needs to
>>> be allocated).
>>>
>>> I re-ran the benchmark with malloc as well, but switching to malloc
>>> degrades performance (on Windows). What does significantly improve
>>> performance is removing the call to get the field:
>>>
>>> Benchmark Mode Cnt
>>> Score Error Units
>>> JmhCallOnly.jni_javacpp avgt 50
>>> 64.804 ▒ 2.588 ns/op
>>> JmhCallOnly.jni_javacpp_getonly avgt 50
>>> 45.543 ▒ 1.876 ns/op
>>> JmhCallOnly.panama avgt 50
>>> 38.244 ▒ 1.496 ns/op
>>> JmhCallOnly.panama_getonly avgt 50
>>> 530.956 ▒ 29.321 ns/op
>>> JmhGetSystemTimeSeconds.jni_javacpp avgt 50
>>> 309.768 ▒ 14.556 ns/op
>>> JmhGetSystemTimeSeconds.jni_javacpp_noget avgt 50
>>> 243.865 ▒ 10.380 ns/op
>>> JmhGetSystemTimeSeconds.panama avgt 50
>>> 4769.212 ▒ 273.042 ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout avgt 50
>>> 608.144 ▒ 26.004 ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout_malloc avgt 50
>>> 711.237 ▒ 33.311 ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout_malloc_noget avgt 50
>>> 104.144 ▒ 4.195 ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout_noget avgt 50
>>> 64.545 ▒ 3.848 ns/op
>>>
>>> Note in particular the `JmhCallOnly.panama_getonly` results compared
>>> to `JmhGetSystemTimeSeconds.panama_prelayout_noget`. The relevant code
>>> for `JmhCallOnly.panama_getonly` is just:
>>>
>>> private static final Scope scope = kernel32_h.scope().fork();
>>> private static final LayoutType<_SYSTEMTIME> systemtimeLayout =
>>> LayoutType.ofStruct(_SYSTEMTIME.class);
>>> private Pointer<_SYSTEMTIME> preallocatedSystemTime;
>>>
>>> public PanamaBenchmark() {
>>> preallocatedSystemTime = scope.allocate(systemtimeLayout);
>>> }
>>>
>>> public short getOnly() { // JmhCallOnly.panama_getonly
>>> return preallocatedSystemTime.get().wSecond$get();
>>> }
>>>
>>> So the real bottleneck seems to be the field get. But, let's
>>> investigate further, since we're doing both a get of the struct
>>> object, and then a get of the field. Let's split this into a get of
>>> the struct, and a get of a pre-computed _SYSTEMTIME object:
>>>
>>> private static final Scope scope = kernel32_h.scope().fork();
>>> private static final LayoutType<_SYSTEMTIME> systemtimeLayout =
>>> LayoutType.ofStruct(_SYSTEMTIME.class);
>>> private Pointer<_SYSTEMTIME> preallocatedSystemTime;
>>> private _SYSTEMTIME struct;
>>>
>>> public PanamaBenchmark() {
>>> preallocatedSystemTime = scope.allocate(systemtimeLayout);
>>> struct = preallocatedSystemTime.get();
>>> }
>>>
>>> public short getOnly() { // JmhCallOnly.panama_getonly
>>> return preallocatedSystemTime.get().wSecond$get();
>>> }
>>>
>>> public short getOnlyFieldDirect() { //
>>> JmhCallOnly.panama_getfield_only
>>> return struct.wSecond$get();
>>> }
>>>
>>> public Object getStructOnly() { //
>>> JmhCallOnly.panama_getstruct_only
>>> return preallocatedSystemTime.get();
>>> }
>>>
>>> Benchmark Mode Cnt Score Error Units
>>> JmhCallOnly.jni_javacpp_getonly avgt 50 48.642 ▒ 1.917 ns/op
>>> JmhCallOnly.panama_getfield_only avgt 50 93.360 ▒ 13.364 ns/op
>>> JmhCallOnly.panama_getonly avgt 50 533.978 ▒ 24.249 ns/op
>>> JmhCallOnly.panama_getstruct_only avgt 50 377.114 ▒ 19.884 ns/op
>>>
>>> So part of the performance loss goes to getting the field, which
>>> creates a bunch of intermediate Pointer objects (see
>>> RuntimeSupport::CasterImpl) [2]. I think the new memaccess API could
>>> really help there, since we can pre-compute a VarHandle for the field,
>>> and shouldn't need any of these intermediate pointer objects.
>>>
>>> But, by far the largest part of the time seems to go to creating the
>>> _SYSTEMTIME object when calling get() on the `Pointer<_SYSTEMTIME>`,
>>> which corresponds to References.OfStruct::get [3]:
>>>
>>> static Struct<?> get(Pointer<?> pointer) {
>>> ((BoundedPointer<?>)pointer).checkAlive();
>>> Class<?> carrier =
>>> ((LayoutTypeImpl<?>)pointer.type()).carrier();
>>> Class<?> structClass =
>>> LibrariesHelper.getStructImplClass(carrier);
>>> try {
>>> return
>>> (Struct<?>)structClass.getConstructor(Pointer.class).newInstance(pointer);
>>>
>>> } catch (ReflectiveOperationException ex) {
>>> throw new IllegalStateException(ex);
>>> }
>>> }
>>>
>>> I once had the idea to try and see what specialization of this code on
>>> a per-Struct-class basis cloud do for performance. Maybe now is a good
>>> time to try it out ;)
>>>
>>>> If Panama doesn't allow them to use raw pointers with
>>>> layouts without going through hoops, that's a usability problem, and
>>>> if those issues are not ironed out eventually, people will be forced
>>>> to keep using JNI.
>>>
>>> I think we are far from being set-in-stone, at least as far as the
>>> high-level API goes. I agree that there should be DYI options for
>>> doing things. I think the current solution being investigated is to
>>> have multiple levels of public API (with memaccess, FFI, and then
>>> Panama), instead of just one high-level API that everyone uses. So
>>> users would have the option to use the low-level APIs to build their
>>> own solution from scratch.
>>>
>>> Jorn
>>>
>>> [1] :
>>> http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/ScopeImpl.java#l216
>>> [2] :
>>> http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/RuntimeSupport.java#l60
>>> [3] :
>>> http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/memory/References.java#l524
>>>
>>>
>>> Samuel Audet schreef op 2019-05-18 12:04:
>>>> If I understand correctly memory allocation is the culprit? Is there a
>>>> way to call something like malloc() with Panama and still be able to
>>>> map it to layouts and/or cast it to whatever we want? Calling malloc()
>>>> with JavaCPP won't do anything w.r.t to deallocation, scopes,
>>>> cleaners, etc, but it's available as an option, because sometimes
>>>> users need it! If Panama doesn't allow them to use raw pointers with
>>>> layouts without going through hoops, that's a usability problem, and
>>>> if those issues are not ironed out eventually, people will be forced
>>>> to keep using JNI.
>>>>
>>>> Samuel
>>>>
>>>> On 5/18/19 12:33 AM, Maurizio Cimadamore wrote:
>>>>> Thanks Jorn,
>>>>> I'd be more interested in knowing the raw native call numbers,
>>>>> does it get any better with linkToNative? Here I'd be expecting
>>>>> performances identical to JNI (since the binder should lower the
>>>>> Pointer to a long, which LinkToNative would then pass by register).
>>>>>
>>>>> As for the fuller benchmark, note that you are also measuring the
>>>>> performances of Scope::allocate, which is internally using some
>>>>> maps. JNR/JNI does not do the same liveliness checks that we do,
>>>>> so the full benchmark is not totally fair. But the arw performance
>>>>> of the downcall should be an apple-to-apple comparison, and it
>>>>> shouldn't be 8x slower as it is now (at least not with linkToNative).
>>>>>
>>>>> Maurizio
>>>>>
>>>>>
>>>>> On 17/05/2019 16:14, Jorn Vernee wrote:
>>>>>
>>>>>> FWIW, I ran the benchmarks with the linkToNative back-end (using
>>>>>> -Djdk.internal.foreign.NativeInvoker.FASTPATH=direct), but it's
>>>>>> still 2x slower than JNI:
>>>>>>
>>>>>> Benchmark Mode Cnt Score Error
>>>>>> Units
>>>>>> JmhGetSystemTimeSeconds.jni_javacpp avgt 50 298.046 ▒
>>>>>> 15.744 ns/op
>>>>>> JmhGetSystemTimeSeconds.panama_prelayout avgt 50 596.567 ▒
>>>>>> 20.570 ns/op
>>>>>>
>>>>>> Of course, like Aleksey says: "The numbers [above] are just data.
>>>>>> To gain reusable insights, you need to follow up on why the
>>>>>> numbers are the way they are.". Unfortunately, I'm having some
>>>>>> trouble getting the project to work with the Windows profiler :/
>>>>>> Was currently looking into that.
>>>>>>
>>>>>> Cheers,
>>>>>> Jorn
>>>>>>
>>>>>> Maurizio Cimadamore schreef op 2019-05-17 16:51:
>>>>>>> On 17/05/2019 11:26, Maurizio Cimadamore wrote:
>>>>>>>> thanks you for bringing this up, I saw this benchmark few days
>>>>>>>> ago and I took a look at it. That benchmark is unfortunately
>>>>>>>> hitting on a couple of (transitory!) pain points: (1) it is
>>>>>>>> running on Windows, which lacks the optimizations available for
>>>>>>>> MacOS and Linux (directInvoker). When the linkToNative effort
>>>>>>>> will be completed, this discrepancy between platforms will go
>>>>>>>> away. The second problem (2) is that the call is passing a big
>>>>>>>> struct (e.g. bigger than 64 bits). Even on Linux and Mac, such
>>>>>>>> a call would be unable to take advantage of the optimized
>>>>>>>> invoker and would fall back to the so called 'universal
>>>>>>>> invoker' which is slow.
>>>>>>>
>>>>>>> Actually, my bad, the bench is passing pointer to structs, not
>>>>>>> structs
>>>>>>> by value - which I think should mean the 'foreign+linkToNative'
>>>>>>> experimental branch should be able to handle this. Would be nice to
>>>>>>> get some confirmation that this is indeed the case.
>>>>>>>
>>>>>>> Maurizio
>
More information about the panama-dev
mailing list