Real-Life Benchmark for FUSE's readdir()
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Jul 15 21:05:38 UTC 2021
I managed to do an initial benchmark pass with JMH. The numbers don't
look great:
```
Benchmark Mode Cnt Score Error Units
BenchmarkTest.testListDirJnr avgt 5 9.858 ± 0.702 us/op
BenchmarkTest.testListDirPanama avgt 5 38.008 ± 17.573 us/op
```
On my machine the Panama fuse seems 4x slower than the JNR one (assuming
I got the implementation correctly, that is :-)).
A quick look at GC, reveals that Panama allocates 4x _less_ memory than JNR:
```
Benchmark Mode Cnt Score Error Units
BenchmarkTest.testListDirJnr:·gc.alloc.rate avgt 5 33.255 ±
1.584 MB/sec
BenchmarkTest.testListDirJnr:·gc.alloc.rate.norm avgt 5 368.033 ±
0.047 B/op
BenchmarkTest.testListDirJnr:·gc.count avgt 5 6.000 counts
BenchmarkTest.testListDirJnr:·gc.time avgt 5 9.000 ms
BenchmarkTest.testListDirPanama:·gc.alloc.rate avgt 5 8.709 ±
3.887 MB/sec
BenchmarkTest.testListDirPanama:·gc.alloc.rate.norm avgt 5 368.046
± 0.236 B/op
BenchmarkTest.testListDirPanama:·gc.count avgt 5 2.000
counts
BenchmarkTest.testListDirPanama:·gc.time avgt 5
3.000 ms
```
And, looking with perfasm, the distribution of the various methods look
similar, and actually not a lot of time is spent in Java at all:
JNR:
```
...[Hottest Methods (after
inlining)]..............................................................
90.86% kernel [unknown]
2.13% c2, level 4 java.nio.file.Files::list, version 913
1.44% c2, level 4
de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirJnr_jmhTest::testListDirJnr_avgt_jmhStub,
version 934
0.81% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0,
version 857
0.70% libc-2.31.so __close_nocancel
0.48% libc-2.31.so malloc
0.39% libc-2.31.so _int_free
0.32% libc-2.31.so _int_malloc
0.29% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0,
version 879
0.22% libc-2.31.so __close
0.21% libc-2.31.so __GI___libc_open
0.19% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::fdopendir,
version 859
0.16% libc-2.31.so __libc_enable_asynccancel
0.16% libc-2.31.so __GI___dup
0.15% libc-2.31.so __fxstat64
0.14% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir,
version 878
0.14% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
0.13% libc-2.31.so __libc_disable_asynccancel
0.13% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
0.13% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
0.84% <...other 69 warm methods...>
```
Panama:
```
....[Hottest Methods (after
inlining)]..............................................................
89.07% kernel [unknown]
2.35% c2, level 4 java.nio.file.Files::list, version 890
1.45% c2, level 4
de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirPanama_jmhTest::testListDirPanama_avgt_jmhStub,
version 917
1.21% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0,
version 849
0.62% libc-2.31.so malloc
0.60% libc-2.31.so __close_nocancel
0.47% libc-2.31.so _int_free
0.46% libc-2.31.so _int_malloc
0.41% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0,
version 869
0.32% libc-2.31.so __close
0.32% libc-2.31.so __GI___libc_open
0.25% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::fdopendir,
version 851
0.23% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
0.20% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir,
version 868
0.17% libc-2.31.so __GI___dup
0.15% libc-2.31.so __alloc_dir
0.15% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
0.15% libc-2.31.so __libc_disable_asynccancel
0.13% libc-2.31.so __libc_enable_asynccancel
0.12% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
1.17% <...other 81 warm methods...>
```
So, no smoking gun so far. I'll keep looking.
Maurizio
On 15/07/2021 19:14, Maurizio Cimadamore wrote:
> Ok, I got it working - what fixed it for me was that the offsets in
> the "filler" calls has to be set to zero, which, looking around, seems
> like a default value which leaves the kernel to take care of it.
>
> ```
> $ ls Volumes/foo
>
> total 0
> -r--r--r-- 1 root root 0 Jan 1 1970 aaa
> -r--r--r-- 1 root root 0 Jan 1 1970 bbb
> -r--r--r-- 1 root root 0 Jan 1 1970 ccc
> -r--r--r-- 1 root root 0 Jan 1 1970 ddd
> -r--r--r-- 1 root root 13 Jan 1 1970 hello.txt
> -r--r--r-- 1 root root 0 Jan 1 1970 xxx
> -r--r--r-- 1 root root 0 Jan 1 1970 yyy
> -r--r--r-- 1 root root 0 Jan 1 1970 zzz
> ```
>
> Something is probably not right (see "total 0" on top), but perhaps
> should be good enough for benchmark. Applying same fix on the JNR also
> fixes that (and results in same output). Hopefully I'm good to go now :-)
>
> Maurizio
>
> On 15/07/2021 18:15, Maurizio Cimadamore wrote:
>> I've re-extracted on Linux, and it seems more lively now. I had to
>> fix a couple of type mismatches (e.g. long vs. int and int vs. short)
>> in places, and also some of the fields in the "stat" structure are
>> different, so the code won't compile as is.
>>
>> After fixing these minor issue, I see a lot more output printed when
>> I mount, and when I do an ls, I see the following lines reported:
>>
>> [Thread-110] DEBUG
>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - getattr() /
>> [Thread-111] DEBUG
>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - opendir() /
>> [Thread-112] DEBUG
>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - readdir() /
>> [Thread-113] DEBUG
>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - releasedir() /
>>
>> But that's pretty much it, and in the terminal I still get the
>> input/output error. I can even debug, but I can't see much of what's
>> going wrong (and I'm not familiar with this API) - the Java code
>> executes fine, for what it's worth.
>>
>> Maurizio
>>
>> On 15/07/2021 17:51, Sebastian Stenzel wrote:
>>> I'll fix it for Linux and let you know!
>>>
>>>> On 15. Jul 2021, at 18:04, Maurizio Cimadamore
>>>> <maurizio.cimadamore at oracle.com
>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>
>>>>
>>>> On 15/07/2021 17:01, Sebastian Stenzel wrote:
>>>>> Yeah well they're pretty much mac-specific. On macOS, FUSE has
>>>>> this magic behaviour where you can tell it to mount to a
>>>>> non-existing mountpoint inside of `/Volumes/...` and it'll just
>>>>> create these (and destroy them on unmount). I believe on Linux you
>>>>> need to define an _existing_ mount point. But it is surely
>>>>> possible that the volume isn't working yet on Linux. I'll give it
>>>>> a try myself and
>>>> I created a folder Volumes under my home folder and trying to use
>>>> that as a mount point, which seems to work ok with JNR.
>>>>> fix it, if required.
>>>>>
>>>>> The benchmark then needs to be adjusted for the two mountpoints
>>>>> respectively. But before the benchmark can actually do anything, a
>>>>> plain `ls` on the terminal needs to work.
>>>>
>>>> Ok, it seems even JNR fails:
>>>>
>>>> ```
>>>> $ ls Volumes/bar/
>>>> ls: reading directory 'Volumes/bar/': Input/output error
>>>> ```
>>>>
>>>>>
>>>>>> On 15. Jul 2021, at 17:55, Maurizio Cimadamore
>>>>>> <maurizio.cimadamore at oracle.com
>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>
>>>>>> I tried to reproduce here on Linux, but with no luck - in the
>>>>>> sense that I'm not super sure on how to run the benchmark.
>>>>>>
>>>>>> I'm able somehow to run the two examples - and I noted that here
>>>>>> JNR works ok, while the Panama one doesn't seem to mount things
>>>>>> correctly - a new mount appears on my file explorer, but I'm
>>>>>> unable to do anything with it (even unmount - which can only be
>>>>>> done at sudo level).
>>>>>>
>>>>>> When working with the JNR support, the mount works fine, it shows
>>>>>> in the file explorer, I can click on that location and browse,
>>>>>> and then unmount from there. Everything works.
>>>>>>
>>>>>> That said, the benchmarks require the mount points to be up and
>>>>>> running - so I've tried first to execute the example (e.g. JNR)
>>>>>> and then run the benchmark in two separate terminal windows, all
>>>>>> via Maven - but the benchmark doesn't seem to do anything (I've
>>>>>> uncommented the benchmarks of course).
>>>>>>
>>>>>> How do you run them?
>>>>>>
>>>>>> Thanks
>>>>>> Maurizio
>>>>>>
>>>>>> On 15/07/2021 13:40, Maurizio Cimadamore wrote:
>>>>>>> I believe that it would be more useful to try to run the perfasm
>>>>>>> profiler with JMH.
>>>>>>>
>>>>>>> This can be done relatively easily, at least on linux, if you
>>>>>>> pass the argument `-prof perfasm` to JMH. (this would need
>>>>>>> hsdid-amd64.so on Linux to print readable assembly).
>>>>>>>
>>>>>>> Another thing worth checking is allocation rate: `-prof gc`.
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>>>>>>>> Ok it really seems like VisualVM can't deal with these kinds of
>>>>>>>> tasks yet. Now it reports the String constructor being the
>>>>>>>> culprit [1], however I strongly doubt that, since this is
>>>>>>>> probably one of the most heavily optimized parts of the JDK.
>>>>>>>>
>>>>>>>> [1]: Screenshot
>>>>>>>> onhttps://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$
>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$><https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$
>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel
>>>>>>>>> <sebastian.stenzel at gmail.com
>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com
>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>
>>>>>>>>> Yes, must be a sampling error. Do you know of a (publicly
>>>>>>>>> available) _profiler_ that is compatible with JDK 17 / 18
>>>>>>>>> already?
>>>>>>>>>
>>>>>>>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore
>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Aha - it seems like you are seeing what I was seeing:
>>>>>>>>>> unrolling now seems to happen more reliably, which positively
>>>>>>>>>> affect code like strlen.
>>>>>>>>>>
>>>>>>>>>> As for FUSE, I think the reason for the difference has
>>>>>>>>>> probably nothing to do with string conversion - the sampler
>>>>>>>>>> profiler just happens to hit that code a lot. I checked JNR
>>>>>>>>>> code for string conversion and I couldn't really find
>>>>>>>>>> anything uber optimized in that regard that could explain the
>>>>>>>>>> gap.
>>>>>>>>>>
>>>>>>>>>> Probably something is not getting optimized as it should -
>>>>>>>>>> likely a downcall/upcall intrinsification is failing - maybe
>>>>>>>>>> due to a subtle issue with your code, or, possibly because
>>>>>>>>>> you are hitting a non-implemented case (e.g. we do not
>>>>>>>>>> intrinsify calls which pass arguments on the stack, yet), or
>>>>>>>>>> because of some other bug.
>>>>>>>>>>
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>>>>>>>> Wow, I stand corrected. I just re-ran the benchmark and
>>>>>>>>>>> `benchmarkStrlenBase` just got a lot faster!! Your change
>>>>>>>>>>> inhttps://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$><https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$>>
>>>>>>>>>>> DID have an affect after all.
>>>>>>>>>>>
>>>>>>>>>>> Just doesn't impress FUSE very much...
>>>>>>>>>>>
>>>>>>>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel
>>>>>>>>>>>> <sebastian.stenzel at gmail.com
>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com
>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Yup, I tried the int-approach as well, but with worse
>>>>>>>>>>>> results... Here is the full
>>>>>>>>>>>> test:https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$
>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$><https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$
>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore
>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok. Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried similar experiments where instead of reading 4
>>>>>>>>>>>>> bytes separately I'd read a single int value, and then use
>>>>>>>>>>>>> shifts and bitmasking to check for terminators. On paper
>>>>>>>>>>>>> good, but benchmark results were always worse than the
>>>>>>>>>>>>> version we have now (at least on Linux).
>>>>>>>>>>>>>
>>>>>>>>>>>>> That said, if you could please share the full string
>>>>>>>>>>>>> benchmark you have, that'd be helpful, so we can take a
>>>>>>>>>>>>> look at that, and see what's going wrong (ideally, C2
>>>>>>>>>>>>> should be the one doing unrolling).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>>>>>>>> I just did a quick snythetic test on a "manually
>>>>>>>>>>>>>> unrolled" strlen() without any FUSE context.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I experimented with an implementation that looked like
>>>>>>>>>>>>>> the following and benchmarked it using a 259 byte memory
>>>>>>>>>>>>>> segment containing a 239 byte string (null byte at index
>>>>>>>>>>>>>> 240):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> private static int strlenUnroll4(MemorySegment segment,
>>>>>>>>>>>>>> long start) {
>>>>>>>>>>>>>> int offset;
>>>>>>>>>>>>>> for (offset = 0; offset < segment.byteSize()-3; offset+=4) {
>>>>>>>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>> offset + 0);
>>>>>>>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>> offset + 1);
>>>>>>>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>> offset + 2);
>>>>>>>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>> offset + 3);
>>>>>>>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is
>>>>>>>>>>>>>> this even faster than directly having 4 different branches?
>>>>>>>>>>>>>> if (b0 == 0) {
>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>> } else if (b1 == 0) {
>>>>>>>>>>>>>> return offset + 1;
>>>>>>>>>>>>>> } else if (b2 == 0) {
>>>>>>>>>>>>>> return offset + 2;
>>>>>>>>>>>>>> } else if (b3 == 0) {
>>>>>>>>>>>>>> return offset + 3;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no
>>>>>>>>>>>>>> loop required for the remaining <4 bytes?
>>>>>>>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>> offset);
>>>>>>>>>>>>>> if (b == 0) {
>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not even sure how reliable my results are, since I
>>>>>>>>>>>>>> have no clue about how branch prediction works here...
>>>>>>>>>>>>>> Neither have I tested the correctness of this
>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore
>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for reporting back.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We probably need to investigate this a bit more deeply
>>>>>>>>>>>>>>> and try and reproduce on our side.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One last question: you said that with manual unrolling
>>>>>>>>>>>>>>> you managed to get 2x faster: did you mean that string
>>>>>>>>>>>>>>> conversion got 2x faster or that you actually saw your
>>>>>>>>>>>>>>> FUSE benchmark going 2x faster because of the manual
>>>>>>>>>>>>>>> unrolling with strings?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>> That, surprisingly, didn't change anything either. But
>>>>>>>>>>>>>>>> don't worry too much, the performance isn't bad (in
>>>>>>>>>>>>>>>> absolute figures) and it is by far not the only reason
>>>>>>>>>>>>>>>> why I consider panama the best solution to create java
>>>>>>>>>>>>>>>> bindings for c libs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore
>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Actually, after some bisecting, I found out that the
>>>>>>>>>>>>>>>>> performance of converting a memory segment into a
>>>>>>>>>>>>>>>>> string jumped 2x faster with this fix:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Which was integrated after the one I originally
>>>>>>>>>>>>>>>>> pointed at. They both seem to touch loop optimization
>>>>>>>>>>>>>>>>> in case of overflows, which the strlen code is
>>>>>>>>>>>>>>>>> triggering (since the loop limit checks for loop
>>>>>>>>>>>>>>>>> variable being positive).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is a simple patch which adds a string conversion
>>>>>>>>>>>>>>>>> test:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +++
>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME,
>>>>>>>>>>>>>>>>> true));
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> + MemorySegment segment;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> @Setup
>>>>>>>>>>>>>>>>> public void setup() {
>>>>>>>>>>>>>>>>> str = makeString(size);
>>>>>>>>>>>>>>>>> segmentAllocator =
>>>>>>>>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size
>>>>>>>>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>>>>>>>> + segment = toCString(str, segmentAllocator);
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @TearDown
>>>>>>>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>> scope.close();
>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> + @Benchmark
>>>>>>>>>>>>>>>>> + public String panama_str_conv() throws Throwable {
>>>>>>>>>>>>>>>>> + return CLinker.toJavaString(segment);
>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> @Benchmark
>>>>>>>>>>>>>>>>> public int jni_strlen() throws Throwable {
>>>>>>>>>>>>>>>>> return strlen(str);
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>> Benchmark (size) Mode Cnt Score
>>>>>>>>>>>>>>>>> Error Units
>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv 100 avgt 30 106.613
>>>>>>>>>>>>>>>>> ? 7.060 ns/op
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>> Benchmark (size) Mode Cnt Score
>>>>>>>>>>>>>>>>> Error Units
>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv 100 avgt 30 48.120
>>>>>>>>>>>>>>>>> ? 0.557 ns/op
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So, as you can see, a pretty sizeable jump.
>>>>>>>>>>>>>>>>> Eyeballing, the shape of generated code doesn't look
>>>>>>>>>>>>>>>>> too different, which makes me think of another case
>>>>>>>>>>>>>>>>> where loop is unrolled, but main loop never executed
>>>>>>>>>>>>>>>>> (similar to JDK-8269230), but we'll need to look deeper.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for
>>>>>>>>>>>>>>>>>>> details how I built the JDK, see my initial email).
>>>>>>>>>>>>>>>>>>> Maybe I'm missing some compiler flags to enable all
>>>>>>>>>>>>>>>>>>> optimizations?
>>>>>>>>>>>>>>>>>> I see - you do have the latest panama changes, but
>>>>>>>>>>>>>>>>>> there has been a sync with upstream after that
>>>>>>>>>>>>>>>>>> changeset, I believe - can you please try to resync
>>>>>>>>>>>>>>>>>> with the latest foreign-jextract commit - which
>>>>>>>>>>>>>>>>>> should be:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are you sure about loop vectorization being applied
>>>>>>>>>>>>>>>>>>> to strlen? I'm not an expert on this field, but I
>>>>>>>>>>>>>>>>>>> had the impression this wasn't possible when the
>>>>>>>>>>>>>>>>>>> loop terminates "from within".
>>>>>>>>>>>>>>>>>> Vlad is the expert here - when chatting offline he
>>>>>>>>>>>>>>>>>> did mention that loop should have single exit - which
>>>>>>>>>>>>>>>>>> I guess also takes into account the "normal" exit -
>>>>>>>>>>>>>>>>>> so the strlen routine would seem to have two exits...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore
>>>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>> thanks for sharing your findings - I've done some
>>>>>>>>>>>>>>>>>>>> attempts here with a targeted microbenchmark which
>>>>>>>>>>>>>>>>>>>> measures the performance of string conversion and
>>>>>>>>>>>>>>>>>>>> I'm seeing unrolling and vectorization being
>>>>>>>>>>>>>>>>>>>> applied on the strlen computation.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not been
>>>>>>>>>>>>>>>>>>>> updated in the last few weeks? There has been a C2
>>>>>>>>>>>>>>>>>>>> optimization fix which has been added recently,
>>>>>>>>>>>>>>>>>>>> which I think might be related to this:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>>>>>>>>>>>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8269230>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> good idea, but it makes no difference beyond
>>>>>>>>>>>>>>>>>>>>> statistical error.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I started sampling the application with VisualVM
>>>>>>>>>>>>>>>>>>>>> (which is quite hard, since native threads are
>>>>>>>>>>>>>>>>>>>>> extremely short-lived. What I noticed is, that
>>>>>>>>>>>>>>>>>>>>> regardless of where the sampler interrupts a
>>>>>>>>>>>>>>>>>>>>> thread, in nearly all cases 100% of CPU time are
>>>>>>>>>>>>>>>>>>>>> caused by
>>>>>>>>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal()
>>>>>>>>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to
>>>>>>>>>>>>>>>>>>>>> the nature of null termination, but maybe we can
>>>>>>>>>>>>>>>>>>>>> make use of the fact that we're dealing with
>>>>>>>>>>>>>>>>>>>>> MemorySegments here: Since they protect us from
>>>>>>>>>>>>>>>>>>>>> overflows, maybe there is no need to look at only
>>>>>>>>>>>>>>>>>>>>> a single byte at a time. Maybe the strlen()-loop
>>>>>>>>>>>>>>>>>>>>> can be unrolled or even be vectorized.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup
>>>>>>>>>>>>>>>>>>>>> when doing a x4 loop unroll.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee
>>>>>>>>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code,
>>>>>>>>>>>>>>>>>>>>>> one possible explanation for the discrepancy I
>>>>>>>>>>>>>>>>>>>>>> can think of is that the DirFiller ends up using
>>>>>>>>>>>>>>>>>>>>>> virtual downcalls to do it's work, which are
>>>>>>>>>>>>>>>>>>>>>> currently not intrinsified. Being mostly a case
>>>>>>>>>>>>>>>>>>>>>> of 'not implemented yet', i.e. it is a known issue.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> static fuse_fill_dir_t
>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3)
>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>> try {
>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr,
>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is not
>>>>>>>>>>>>>>>>>>>>>> a constant, so the call is virtual
>>>>>>>>>>>>>>>>>>>>>> } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>> throw new AssertionError("should
>>>>>>>>>>>>>>>>>>>>>> not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> For testing purposes, a possible workaround could
>>>>>>>>>>>>>>>>>>>>>> be to have a cache that maps the callback address
>>>>>>>>>>>>>>>>>>>>>> to a method handle that has the address bound to
>>>>>>>>>>>>>>>>>>>>>> the first parameter. Assuming readdir always gets
>>>>>>>>>>>>>>>>>>>>>> the same filler callback address, the same
>>>>>>>>>>>>>>>>>>>>>> MethodHandle will be reused and eventually
>>>>>>>>>>>>>>>>>>>>>> customized which means the callback address will
>>>>>>>>>>>>>>>>>>>>>> become constant, and the downcall should then be
>>>>>>>>>>>>>>>>>>>>>> intrinsified.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine to
>>>>>>>>>>>>>>>>>>>>>> test this, but if you want to try it out, the
>>>>>>>>>>>>>>>>>>>>>> patch should be this:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +++
>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>>>>>>> package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>>>>>>> import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>>>>>>> import java.nio.ByteOrder;
>>>>>>>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>>>>>>> import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>>>>>>> public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface
>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class,
>>>>>>>>>>>>>>>>>>>>>> fi, constants$0.fuse_fill_dir_t$FUNC,
>>>>>>>>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I",
>>>>>>>>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> static fuse_fill_dir_t
>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>> - return
>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3)
>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>> - try {
>>>>>>>>>>>>>>>>>>>>>> - return
>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr,
>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>> - } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>> - throw new AssertionError("should
>>>>>>>>>>>>>>>>>>>>>> not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>> - }
>>>>>>>>>>>>>>>>>>>>>> - };
>>>>>>>>>>>>>>>>>>>>>> + class CacheHolder {
>>>>>>>>>>>>>>>>>>>>>> + static final Map<MemoryAddress,
>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>> CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>>>>>>>>>>>>>>>>> + final MethodHandle target =
>>>>>>>>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH,
>>>>>>>>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3)
>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>> + try {
>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>> (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>> + } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>> + throw new
>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>>>>>>>>> + });
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too
>>>>>>>>>>>>>>>>>>>>>> much by line wrapping)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark
>>>>>>>>>>>>>>>>>>>>>>> test, that includes several down- and upcalls.
>>>>>>>>>>>>>>>>>>>>>>> First, let me explain, what I'm testing here:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding,
>>>>>>>>>>>>>>>>>>>>>>> mostly for experimental purposes right now, and
>>>>>>>>>>>>>>>>>>>>>>> I'm trying to beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> While there are some other interesting metrics,
>>>>>>>>>>>>>>>>>>>>>>> such as read/write performance (both
>>>>>>>>>>>>>>>>>>>>>>> sequentially and random access), I focused on
>>>>>>>>>>>>>>>>>>>>>>> directory listings for now. Directory listings
>>>>>>>>>>>>>>>>>>>>>>> are the most complex operation in regards to the
>>>>>>>>>>>>>>>>>>>>>>> number of down- and upcalls:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback
>>>>>>>>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in
>>>>>>>>>>>>>>>>>>>>>>> the directory
>>>>>>>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no longer
>>>>>>>>>>>>>>>>>>>>>>> required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces
>>>>>>>>>>>>>>>>>>>>>>> additional noise (such as readxattr and trying
>>>>>>>>>>>>>>>>>>>>>>> to access files that I didn't report in readdir))
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this:
>>>>>>>>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();`
>>>>>>>>>>>>>>>>>>>>>>> with the volume reporting eight files [2]. When
>>>>>>>>>>>>>>>>>>>>>>> mounting with debug logs enabled, I can see that
>>>>>>>>>>>>>>>>>>>>>>> the exact same operations in the same order are
>>>>>>>>>>>>>>>>>>>>>>> invoked on both fuse-jnr and fuse-panama. One
>>>>>>>>>>>>>>>>>>>>>>> single dir listing results in 2 readdir upcalls,
>>>>>>>>>>>>>>>>>>>>>>> 10 callback downcalls, 16 getattr upcalls. There
>>>>>>>>>>>>>>>>>>>>>>> are also 8 getxattr calls and 16 lookup calls,
>>>>>>>>>>>>>>>>>>>>>>> however they don't reach Java, as the FUSE
>>>>>>>>>>>>>>>>>>>>>>> kernel knows they are not implemented.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>> Benchmark Mode Cnt
>>>>>>>>>>>>>>>>>>>>>>> Score Error Units
>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr avgt 5
>>>>>>>>>>>>>>>>>>>>>>> 66,569 ± 3,128 us/op
>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama avgt 5
>>>>>>>>>>>>>>>>>>>>>>> 189,340 ± 4,275 us/op
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit
>>>>>>>>>>>>>>>>>>>>>>> 42e03fd7c6a built with: `configure
>>>>>>>>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/
>>>>>>>>>>>>>>>>>>>>>>> --with-native-debug-symbols=none
>>>>>>>>>>>>>>>>>>>>>>> --with-debug-level=release
>>>>>>>>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm
>>>>>>>>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from.
>>>>>>>>>>>>>>>>>>>>>>> Maybe creating a newConfinedScope() during each
>>>>>>>>>>>>>>>>>>>>>>> upcall [3] is "too much"? Maybe JNR is just
>>>>>>>>>>>>>>>>>>>>>>> negligently skipping some memory boundary checks
>>>>>>>>>>>>>>>>>>>>>>> to be faster. The results are not terrible, but
>>>>>>>>>>>>>>>>>>>>>>> I'd hoped for something better.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$
>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$
>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$
>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$
>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$
>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$
>>>>>>>>>>>>>>>>>>>>>>> >
>>>
More information about the panama-dev
mailing list