Understanding the performance of my FFI-based API

Fri Mar 10 18:05:39 UTC 2023

Hi Alan,
I did some more experiment on your repository.

First of all, I fixed the keySegment benchmark utility function to do this:

|private MemorySegment getKeySegment() { final int MAX_LEN = 9; // 
key100000 final int keyIdx = next(); final String keyStr = "key" + 
keyIdx; for (int i = 0; i < keyStr.length(); ++i) { 
keySegment.set(ValueLayout.JAVA_BYTE, i, (byte)keyStr.charAt(i)); } for 
(int i = keyStr.length(); i < MAX_LEN; ++i) { 
keySegment.set(ValueLayout.JAVA_BYTE, i, (byte) 0x30); } return 
keySegment; } |

E.g. bring it in sync with the buffer version.

Then I made, as suggested yesterday, all MethodHandles in FFIMethod 
static AND final.

Also, in FFILayout, I added a call to |.withInvokeExactBehavior()| to 
each var handle creation. This is helpful to detect inexact calls. I 
found few inexact calls:

  * one in FFIDB.java - the result of the foreign call inside
    getPinnableSlice() is casted to |Long| instead of |long|
  * one in FFIPinnableSlice.java - the |isPinned()| method also casts to
    |Boolean|, not |boolean|

Then, the |fromPinnable| factory contains some dubious code which is 
creating a buffer from a segment, just to do a copy. I replaced with this:

|MemorySegment.copy(pinnableSlice.data(), ValueLayout.JAVA_BYTE, 0, 
value, 0, (int)size); |

I’ve also updated the code to use the Java 20 API, to make sure I ran 
with reasonably up to date JVM.

Before these changes, I could see a difference between FFI and JNI, 
especially in the preallocated benchmark variants. With the changes 
above, it looks like this here:

|Benchmark (columnFamilyTestType) (keyCount) (keySize) (valueSize) Mode 
Cnt Score Error Units GetBenchmarks.ffiGet no_column_family 1000 128 
4096 thrpt 30 596.591 ± 6.448 ops/ms GetBenchmarks.ffiGet 
no_column_family 1000 128 65536 thrpt 30 60.277 ± 0.547 ops/ms 
GetBenchmarks.ffiGetPinnableSlice no_column_family 1000 128 4096 thrpt 
30 771.631 ± 13.835 ops/ms GetBenchmarks.ffiGetPinnableSlice 
no_column_family 1000 128 65536 thrpt 30 111.709 ± 1.306 ops/ms 
GetBenchmarks.ffiGetRandom no_column_family 1000 128 4096 thrpt 30 
591.891 ± 7.353 ops/ms GetBenchmarks.ffiGetRandom no_column_family 1000 
128 65536 thrpt 30 68.197 ± 0.600 ops/ms GetBenchmarks.ffiIdentity 
no_column_family 1000 128 4096 thrpt 30 58709.753 ± 712.660 ops/ms 
GetBenchmarks.ffiIdentity no_column_family 1000 128 65536 thrpt 30 
59265.794 ± 834.989 ops/ms GetBenchmarks.ffiPreallocatedGet 
no_column_family 1000 128 4096 thrpt 30 736.686 ± 8.370 ops/ms 
GetBenchmarks.ffiPreallocatedGet no_column_family 1000 128 65536 thrpt 
30 101.211 ± 0.347 ops/ms GetBenchmarks.ffiPreallocatedGetRandom 
no_column_family 1000 128 4096 thrpt 30 598.381 ± 6.252 ops/ms 
GetBenchmarks.ffiPreallocatedGetRandom no_column_family 1000 128 65536 
thrpt 30 68.037 ± 0.632 ops/ms GetBenchmarks.get no_column_family 1000 
128 4096 thrpt 30 559.800 ± 3.369 ops/ms GetBenchmarks.get 
no_column_family 1000 128 65536 thrpt 30 60.567 ± 0.380 ops/ms 
GetBenchmarks.preallocatedByteBufferGet no_column_family 1000 128 4096 
thrpt 30 758.639 ± 13.025 ops/ms GetBenchmarks.preallocatedByteBufferGet 
no_column_family 1000 128 65536 thrpt 30 103.190 ± 1.219 ops/ms 
GetBenchmarks.preallocatedByteBufferGetRandom no_column_family 1000 128 
4096 thrpt 30 753.189 ± 12.498 ops/ms 
GetBenchmarks.preallocatedByteBufferGetRandom no_column_family 1000 128 
65536 thrpt 30 103.644 ± 3.625 ops/ms GetBenchmarks.preallocatedGet 
no_column_family 1000 128 4096 thrpt 30 707.330 ± 10.811 ops/ms 
GetBenchmarks.preallocatedGet no_column_family 1000 128 65536 thrpt 30 
96.743 ± 1.609 ops/ms |

It seems most of the numbers are roughly the same.

I’m not sure how much the update to 20 matters - maybe try to fix all of 
the other stuff first, and see what happens (inexact var handle calls 
can be quite slow compared to Unsafe memory access).

Cheers
Maurizio

On 09/03/2023 18:13, Alan Paxton wrote:

> Hi Maurizio,
>
> Thanks for the quick and detailed response. I think our goals coincide 
> as it would make life easier for rocksjava to successfully implement 
> an FFI API.
>
> A couple of quick initial reruns shows me your suggestions both 
> contribute a small amount of improvement, but probably do not account 
> for all the performance I am missing. I shall rerun the full benchmark 
> for confirmation.
>
> And since both suggestions give me a clearer idea what might be 
> performance issues, I will take another pass over my code and see if I 
> can spot any other potential problems in how it's implemented, or 
> anything else that isn't truly like-for-like with the JNI version.
>
> --Alan
>
> On Thu, Mar 9, 2023 at 12:08 PM Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com> wrote:
>
>     Also, zooming into the benchmark, something funny seems to be
>     going on with "getKeySegment". This seems different from the
>     "getKeyArr" counterpart, but also has a new issue: I believe that,
>     in JNI, you just passed the Java array "as is" - but in Panama you
>     can't (as the array is on-heap), so there is some double-copying
>     involved there (e.g. you create an on-heap array, which then is
>     moved off-heap).
>
>     If I'm not mistaken, this method is executed on every benchmark
>     iteration, so the comparison doesn't just mesure the cost of the
>     native call, but also the cost it takes to marshal data from Java
>     heap to native.
>
>     For instance, the byte buffer versions ("keyBuf") seem to avoid
>     this problem by copying the data directly off-heap (by using a
>     direct buffer). I think the benchmark should use a native segment,
>     and avoid the copy so that at least we avoid that source of noise
>     in the numbers.
>
>     Cheers
>     Maurizio
>
>     On 09/03/2023 11:29, Maurizio Cimadamore wrote:
>>     Hi Alan,
>>     first of all, I'd like to thank you for taking the time to share
>>     your experience and to write it all up in a document. Stuff like
>>     that is very valuable to us, especially at this stage in the
>>     project.
>>
>>     One quick suggestion when eyeballing your code: your method
>>     handles are "static", but not "final". I suggest you try to
>>     sprinkle "final" in, and see whether that does the trick. If not,
>>     we'd have to look deeper.
>>
>>     Cheers
>>     Maurizio
>>
>>     On 09/03/2023 11:15, Alan Paxton wrote:
>>>     Hi,
>>>
>>>     I hope this is an appropriate list for this question.
>>>
>>>     I have been prototyping an FFI-based version of the RocksDB Java
>>>     API, which is currently implemented in JNI. RocksDB is a C++
>>>     based key,value-store with a Java API layered on top. I have
>>>     done some benchmarking of the FFI implementation, versus the JNI
>>>     version and I find it performs consistently slightly slower than
>>>     the current API.
>>>
>>>     I would like to understand if this is to be expected, e.g. does
>>>     FFI do more safety checking under the covers when calling a
>>>     native method ?
>>>     Or is the performance likely to improve between the preview in
>>>     Java 19 and release in Java 21 ?
>>>     If there are resources or suggestions that would help me dig
>>>     into the performance I'd be very grateful to be pointed to them.
>>>
>>>     For the use case I'm measuring, data is transferred in native
>>>     memory originally allocated by RocksDB in C++ which I wrap as a
>>>     MemorySegment; I do allocate native memory for the request
>>>     structure.
>>>
>>>     These are links to the PR and some documentation of the work:
>>>
>>>     https://github.com/facebook/rocksdb/pull/11095
>>>     <https://urldefense.com/v3/__https://github.com/facebook/rocksdb/pull/11095__;!!ACWV5N9M2RV99hQ!PM3HGf9CTDeNF5zsB_t5qffUH17pmZ2W8psJF6ewjgUHDJnrxu60CgJnhOr3DF3lPl6YPKe-nib38M3LwP3O-57EKB8O$>
>>>
>>>     https://github.com/alanpaxton/rocksdb/blob/eb-1680-panama-ffi/java/JavaFFI.md
>>>     <https://urldefense.com/v3/__https://github.com/alanpaxton/rocksdb/blob/eb-1680-panama-ffi/java/JavaFFI.md__;!!ACWV5N9M2RV99hQ!PM3HGf9CTDeNF5zsB_t5qffUH17pmZ2W8psJF6ewjgUHDJnrxu60CgJnhOr3DF3lPl6YPKe-nib38M3LwP3O-5SkpJEU$>
>>>
>>>
>>>     Many thanks,
>>>     Alan Paxton
>>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230310/febba0c5/attachment-0001.htm>