[foreign] Poor performance?
Jorn Vernee
jbvernee at xs4all.nl
Sat May 18 16:42:53 UTC 2019
Users could bind `malloc` and `free` and use those instead. But,
allocation really isn't the problem here... The allocation using Scope
in the current link2native implementation is actually really fast, since
it allocates a slab of 64KB when creating a Scope (which is reused
between benchmark calls), and the actual allocations per benchmark call
are just pointer bumps [1] (until a new slab needs to be allocated).
I re-ran the benchmark with malloc as well, but switching to malloc
degrades performance (on Windows). What does significantly improve
performance is removing the call to get the field:
Benchmark Mode Cnt
Score Error Units
JmhCallOnly.jni_javacpp avgt 50
64.804 ▒ 2.588 ns/op
JmhCallOnly.jni_javacpp_getonly avgt 50
45.543 ▒ 1.876 ns/op
JmhCallOnly.panama avgt 50
38.244 ▒ 1.496 ns/op
JmhCallOnly.panama_getonly avgt 50
530.956 ▒ 29.321 ns/op
JmhGetSystemTimeSeconds.jni_javacpp avgt 50
309.768 ▒ 14.556 ns/op
JmhGetSystemTimeSeconds.jni_javacpp_noget avgt 50
243.865 ▒ 10.380 ns/op
JmhGetSystemTimeSeconds.panama avgt 50
4769.212 ▒ 273.042 ns/op
JmhGetSystemTimeSeconds.panama_prelayout avgt 50
608.144 ▒ 26.004 ns/op
JmhGetSystemTimeSeconds.panama_prelayout_malloc avgt 50
711.237 ▒ 33.311 ns/op
JmhGetSystemTimeSeconds.panama_prelayout_malloc_noget avgt 50
104.144 ▒ 4.195 ns/op
JmhGetSystemTimeSeconds.panama_prelayout_noget avgt 50
64.545 ▒ 3.848 ns/op
Note in particular the `JmhCallOnly.panama_getonly` results compared to
`JmhGetSystemTimeSeconds.panama_prelayout_noget`. The relevant code for
`JmhCallOnly.panama_getonly` is just:
private static final Scope scope = kernel32_h.scope().fork();
private static final LayoutType<_SYSTEMTIME> systemtimeLayout =
LayoutType.ofStruct(_SYSTEMTIME.class);
private Pointer<_SYSTEMTIME> preallocatedSystemTime;
public PanamaBenchmark() {
preallocatedSystemTime = scope.allocate(systemtimeLayout);
}
public short getOnly() { // JmhCallOnly.panama_getonly
return preallocatedSystemTime.get().wSecond$get();
}
So the real bottleneck seems to be the field get. But, let's investigate
further, since we're doing both a get of the struct object, and then a
get of the field. Let's split this into a get of the struct, and a get
of a pre-computed _SYSTEMTIME object:
private static final Scope scope = kernel32_h.scope().fork();
private static final LayoutType<_SYSTEMTIME> systemtimeLayout =
LayoutType.ofStruct(_SYSTEMTIME.class);
private Pointer<_SYSTEMTIME> preallocatedSystemTime;
private _SYSTEMTIME struct;
public PanamaBenchmark() {
preallocatedSystemTime = scope.allocate(systemtimeLayout);
struct = preallocatedSystemTime.get();
}
public short getOnly() { // JmhCallOnly.panama_getonly
return preallocatedSystemTime.get().wSecond$get();
}
public short getOnlyFieldDirect() { //
JmhCallOnly.panama_getfield_only
return struct.wSecond$get();
}
public Object getStructOnly() { // JmhCallOnly.panama_getstruct_only
return preallocatedSystemTime.get();
}
Benchmark Mode Cnt Score Error Units
JmhCallOnly.jni_javacpp_getonly avgt 50 48.642 ▒ 1.917 ns/op
JmhCallOnly.panama_getfield_only avgt 50 93.360 ▒ 13.364 ns/op
JmhCallOnly.panama_getonly avgt 50 533.978 ▒ 24.249 ns/op
JmhCallOnly.panama_getstruct_only avgt 50 377.114 ▒ 19.884 ns/op
So part of the performance loss goes to getting the field, which creates
a bunch of intermediate Pointer objects (see RuntimeSupport::CasterImpl)
[2]. I think the new memaccess API could really help there, since we can
pre-compute a VarHandle for the field, and shouldn't need any of these
intermediate pointer objects.
But, by far the largest part of the time seems to go to creating the
_SYSTEMTIME object when calling get() on the `Pointer<_SYSTEMTIME>`,
which corresponds to References.OfStruct::get [3]:
static Struct<?> get(Pointer<?> pointer) {
((BoundedPointer<?>)pointer).checkAlive();
Class<?> carrier =
((LayoutTypeImpl<?>)pointer.type()).carrier();
Class<?> structClass =
LibrariesHelper.getStructImplClass(carrier);
try {
return
(Struct<?>)structClass.getConstructor(Pointer.class).newInstance(pointer);
} catch (ReflectiveOperationException ex) {
throw new IllegalStateException(ex);
}
}
I once had the idea to try and see what specialization of this code on a
per-Struct-class basis cloud do for performance. Maybe now is a good
time to try it out ;)
> If Panama doesn't allow them to use raw pointers with
> layouts without going through hoops, that's a usability problem, and
> if those issues are not ironed out eventually, people will be forced
> to keep using JNI.
I think we are far from being set-in-stone, at least as far as the
high-level API goes. I agree that there should be DYI options for doing
things. I think the current solution being investigated is to have
multiple levels of public API (with memaccess, FFI, and then Panama),
instead of just one high-level API that everyone uses. So users would
have the option to use the low-level APIs to build their own solution
from scratch.
Jorn
[1] :
http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/ScopeImpl.java#l216
[2] :
http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/RuntimeSupport.java#l60
[3] :
http://hg.openjdk.java.net/panama/dev/file/cef8136ee7ee/src/java.base/share/classes/jdk/internal/foreign/memory/References.java#l524
Samuel Audet schreef op 2019-05-18 12:04:
> If I understand correctly memory allocation is the culprit? Is there a
> way to call something like malloc() with Panama and still be able to
> map it to layouts and/or cast it to whatever we want? Calling malloc()
> with JavaCPP won't do anything w.r.t to deallocation, scopes,
> cleaners, etc, but it's available as an option, because sometimes
> users need it! If Panama doesn't allow them to use raw pointers with
> layouts without going through hoops, that's a usability problem, and
> if those issues are not ironed out eventually, people will be forced
> to keep using JNI.
>
> Samuel
>
> On 5/18/19 12:33 AM, Maurizio Cimadamore wrote:
>> Thanks Jorn,
>> I'd be more interested in knowing the raw native call numbers, does it
>> get any better with linkToNative? Here I'd be expecting performances
>> identical to JNI (since the binder should lower the Pointer to a long,
>> which LinkToNative would then pass by register).
>>
>> As for the fuller benchmark, note that you are also measuring the
>> performances of Scope::allocate, which is internally using some maps.
>> JNR/JNI does not do the same liveliness checks that we do, so the full
>> benchmark is not totally fair. But the arw performance of the downcall
>> should be an apple-to-apple comparison, and it shouldn't be 8x slower
>> as it is now (at least not with linkToNative).
>>
>> Maurizio
>>
>>
>> On 17/05/2019 16:14, Jorn Vernee wrote:
>>
>>> FWIW, I ran the benchmarks with the linkToNative back-end (using
>>> -Djdk.internal.foreign.NativeInvoker.FASTPATH=direct), but it's still
>>> 2x slower than JNI:
>>>
>>> Benchmark Mode Cnt Score Error
>>> Units
>>> JmhGetSystemTimeSeconds.jni_javacpp avgt 50 298.046 ▒
>>> 15.744 ns/op
>>> JmhGetSystemTimeSeconds.panama_prelayout avgt 50 596.567 ▒
>>> 20.570 ns/op
>>>
>>> Of course, like Aleksey says: "The numbers [above] are just data. To
>>> gain reusable insights, you need to follow up on why the numbers are
>>> the way they are.". Unfortunately, I'm having some trouble getting
>>> the project to work with the Windows profiler :/ Was currently
>>> looking into that.
>>>
>>> Cheers,
>>> Jorn
>>>
>>> Maurizio Cimadamore schreef op 2019-05-17 16:51:
>>>> On 17/05/2019 11:26, Maurizio Cimadamore wrote:
>>>>> thanks you for bringing this up, I saw this benchmark few days ago
>>>>> and I took a look at it. That benchmark is unfortunately hitting on
>>>>> a couple of (transitory!) pain points: (1) it is running on
>>>>> Windows, which lacks the optimizations available for MacOS and
>>>>> Linux (directInvoker). When the linkToNative effort will be
>>>>> completed, this discrepancy between platforms will go away. The
>>>>> second problem (2) is that the call is passing a big struct (e.g.
>>>>> bigger than 64 bits). Even on Linux and Mac, such a call would be
>>>>> unable to take advantage of the optimized invoker and would fall
>>>>> back to the so called 'universal invoker' which is slow.
>>>>
>>>> Actually, my bad, the bench is passing pointer to structs, not
>>>> structs
>>>> by value - which I think should mean the 'foreign+linkToNative'
>>>> experimental branch should be able to handle this. Would be nice to
>>>> get some confirmation that this is indeed the case.
>>>>
>>>> Maurizio
More information about the panama-dev
mailing list