Real-Life Benchmark for FUSE's readdir()
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Jul 15 21:09:16 UTC 2021
Hah - hit "Send" too soon - of course the numbers below are almost
useless because the file system is set up on a separate JVM which the
benchmark cannot see.
I'll try to change the benchmark to run the file system in the same process.
Maurizio
On 15/07/2021 22:05, Maurizio Cimadamore wrote:
> I managed to do an initial benchmark pass with JMH. The numbers don't
> look great:
>
> ```
> Benchmark Mode Cnt Score Error Units
> BenchmarkTest.testListDirJnr avgt 5 9.858 ± 0.702 us/op
> BenchmarkTest.testListDirPanama avgt 5 38.008 ± 17.573 us/op
> ```
>
> On my machine the Panama fuse seems 4x slower than the JNR one
> (assuming I got the implementation correctly, that is :-)).
>
> A quick look at GC, reveals that Panama allocates 4x _less_ memory
> than JNR:
>
> ```
> Benchmark Mode Cnt Score Error Units
> BenchmarkTest.testListDirJnr:·gc.alloc.rate avgt 5 33.255 ±
> 1.584 MB/sec
> BenchmarkTest.testListDirJnr:·gc.alloc.rate.norm avgt 5 368.033
> ± 0.047 B/op
> BenchmarkTest.testListDirJnr:·gc.count avgt 5 6.000 counts
> BenchmarkTest.testListDirJnr:·gc.time avgt 5 9.000 ms
> BenchmarkTest.testListDirPanama:·gc.alloc.rate avgt 5 8.709 ±
> 3.887 MB/sec
> BenchmarkTest.testListDirPanama:·gc.alloc.rate.norm avgt 5 368.046
> ± 0.236 B/op
> BenchmarkTest.testListDirPanama:·gc.count avgt 5 2.000
> counts
> BenchmarkTest.testListDirPanama:·gc.time avgt 5
> 3.000 ms
> ```
>
> And, looking with perfasm, the distribution of the various methods
> look similar, and actually not a lot of time is spent in Java at all:
>
> JNR:
>
> ```
> ...[Hottest Methods (after
> inlining)]..............................................................
> 90.86% kernel [unknown]
> 2.13% c2, level 4 java.nio.file.Files::list, version 913
> 1.44% c2, level 4
> de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirJnr_jmhTest::testListDirJnr_avgt_jmhStub,
> version 934
> 0.81% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0,
> version 857
> 0.70% libc-2.31.so __close_nocancel
> 0.48% libc-2.31.so malloc
> 0.39% libc-2.31.so _int_free
> 0.32% libc-2.31.so _int_malloc
> 0.29% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0,
> version 879
> 0.22% libc-2.31.so __close
> 0.21% libc-2.31.so __GI___libc_open
> 0.19% Unknown, level 0
> sun.nio.fs.UnixNativeDispatcher::fdopendir, version 859
> 0.16% libc-2.31.so __libc_enable_asynccancel
> 0.16% libc-2.31.so __GI___dup
> 0.15% libc-2.31.so __fxstat64
> 0.14% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir,
> version 878
> 0.14% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
> 0.13% libc-2.31.so __libc_disable_asynccancel
> 0.13% libnio.so
> Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
> 0.13% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
> 0.84% <...other 69 warm methods...>
> ```
>
> Panama:
>
> ```
> ....[Hottest Methods (after
> inlining)]..............................................................
> 89.07% kernel [unknown]
> 2.35% c2, level 4 java.nio.file.Files::list, version 890
> 1.45% c2, level 4
> de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirPanama_jmhTest::testListDirPanama_avgt_jmhStub,
> version 917
> 1.21% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0,
> version 849
> 0.62% libc-2.31.so malloc
> 0.60% libc-2.31.so __close_nocancel
> 0.47% libc-2.31.so _int_free
> 0.46% libc-2.31.so _int_malloc
> 0.41% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0,
> version 869
> 0.32% libc-2.31.so __close
> 0.32% libc-2.31.so __GI___libc_open
> 0.25% Unknown, level 0
> sun.nio.fs.UnixNativeDispatcher::fdopendir, version 851
> 0.23% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
> 0.20% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir,
> version 868
> 0.17% libc-2.31.so __GI___dup
> 0.15% libc-2.31.so __alloc_dir
> 0.15% libnio.so
> Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
> 0.15% libc-2.31.so __libc_disable_asynccancel
> 0.13% libc-2.31.so __libc_enable_asynccancel
> 0.12% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
> 1.17% <...other 81 warm methods...>
> ```
>
> So, no smoking gun so far. I'll keep looking.
>
> Maurizio
>
> On 15/07/2021 19:14, Maurizio Cimadamore wrote:
>> Ok, I got it working - what fixed it for me was that the offsets in
>> the "filler" calls has to be set to zero, which, looking around,
>> seems like a default value which leaves the kernel to take care of it.
>>
>> ```
>> $ ls Volumes/foo
>>
>> total 0
>> -r--r--r-- 1 root root 0 Jan 1 1970 aaa
>> -r--r--r-- 1 root root 0 Jan 1 1970 bbb
>> -r--r--r-- 1 root root 0 Jan 1 1970 ccc
>> -r--r--r-- 1 root root 0 Jan 1 1970 ddd
>> -r--r--r-- 1 root root 13 Jan 1 1970 hello.txt
>> -r--r--r-- 1 root root 0 Jan 1 1970 xxx
>> -r--r--r-- 1 root root 0 Jan 1 1970 yyy
>> -r--r--r-- 1 root root 0 Jan 1 1970 zzz
>> ```
>>
>> Something is probably not right (see "total 0" on top), but perhaps
>> should be good enough for benchmark. Applying same fix on the JNR
>> also fixes that (and results in same output). Hopefully I'm good to
>> go now :-)
>>
>> Maurizio
>>
>> On 15/07/2021 18:15, Maurizio Cimadamore wrote:
>>> I've re-extracted on Linux, and it seems more lively now. I had to
>>> fix a couple of type mismatches (e.g. long vs. int and int vs.
>>> short) in places, and also some of the fields in the "stat"
>>> structure are different, so the code won't compile as is.
>>>
>>> After fixing these minor issue, I see a lot more output printed when
>>> I mount, and when I do an ls, I see the following lines reported:
>>>
>>> [Thread-110] DEBUG
>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - getattr() /
>>> [Thread-111] DEBUG
>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - opendir() /
>>> [Thread-112] DEBUG
>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - readdir() /
>>> [Thread-113] DEBUG
>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - releasedir() /
>>>
>>> But that's pretty much it, and in the terminal I still get the
>>> input/output error. I can even debug, but I can't see much of what's
>>> going wrong (and I'm not familiar with this API) - the Java code
>>> executes fine, for what it's worth.
>>>
>>> Maurizio
>>>
>>> On 15/07/2021 17:51, Sebastian Stenzel wrote:
>>>> I'll fix it for Linux and let you know!
>>>>
>>>>> On 15. Jul 2021, at 18:04, Maurizio Cimadamore
>>>>> <maurizio.cimadamore at oracle.com
>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>
>>>>>
>>>>> On 15/07/2021 17:01, Sebastian Stenzel wrote:
>>>>>> Yeah well they're pretty much mac-specific. On macOS, FUSE has
>>>>>> this magic behaviour where you can tell it to mount to a
>>>>>> non-existing mountpoint inside of `/Volumes/...` and it'll just
>>>>>> create these (and destroy them on unmount). I believe on Linux
>>>>>> you need to define an _existing_ mount point. But it is surely
>>>>>> possible that the volume isn't working yet on Linux. I'll give it
>>>>>> a try myself and
>>>>> I created a folder Volumes under my home folder and trying to use
>>>>> that as a mount point, which seems to work ok with JNR.
>>>>>> fix it, if required.
>>>>>>
>>>>>> The benchmark then needs to be adjusted for the two mountpoints
>>>>>> respectively. But before the benchmark can actually do anything,
>>>>>> a plain `ls` on the terminal needs to work.
>>>>>
>>>>> Ok, it seems even JNR fails:
>>>>>
>>>>> ```
>>>>> $ ls Volumes/bar/
>>>>> ls: reading directory 'Volumes/bar/': Input/output error
>>>>> ```
>>>>>
>>>>>>
>>>>>>> On 15. Jul 2021, at 17:55, Maurizio Cimadamore
>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>
>>>>>>> I tried to reproduce here on Linux, but with no luck - in the
>>>>>>> sense that I'm not super sure on how to run the benchmark.
>>>>>>>
>>>>>>> I'm able somehow to run the two examples - and I noted that here
>>>>>>> JNR works ok, while the Panama one doesn't seem to mount things
>>>>>>> correctly - a new mount appears on my file explorer, but I'm
>>>>>>> unable to do anything with it (even unmount - which can only be
>>>>>>> done at sudo level).
>>>>>>>
>>>>>>> When working with the JNR support, the mount works fine, it
>>>>>>> shows in the file explorer, I can click on that location and
>>>>>>> browse, and then unmount from there. Everything works.
>>>>>>>
>>>>>>> That said, the benchmarks require the mount points to be up and
>>>>>>> running - so I've tried first to execute the example (e.g. JNR)
>>>>>>> and then run the benchmark in two separate terminal windows, all
>>>>>>> via Maven - but the benchmark doesn't seem to do anything (I've
>>>>>>> uncommented the benchmarks of course).
>>>>>>>
>>>>>>> How do you run them?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Maurizio
>>>>>>>
>>>>>>> On 15/07/2021 13:40, Maurizio Cimadamore wrote:
>>>>>>>> I believe that it would be more useful to try to run the
>>>>>>>> perfasm profiler with JMH.
>>>>>>>>
>>>>>>>> This can be done relatively easily, at least on linux, if you
>>>>>>>> pass the argument `-prof perfasm` to JMH. (this would need
>>>>>>>> hsdid-amd64.so on Linux to print readable assembly).
>>>>>>>>
>>>>>>>> Another thing worth checking is allocation rate: `-prof gc`.
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>>>>>>>>> Ok it really seems like VisualVM can't deal with these kinds
>>>>>>>>> of tasks yet. Now it reports the String constructor being the
>>>>>>>>> culprit [1], however I strongly doubt that, since this is
>>>>>>>>> probably one of the most heavily optimized parts of the JDK.
>>>>>>>>>
>>>>>>>>> [1]: Screenshot
>>>>>>>>> onhttps://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$
>>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$><https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$
>>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel
>>>>>>>>>> <sebastian.stenzel at gmail.com
>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com
>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Yes, must be a sampling error. Do you know of a (publicly
>>>>>>>>>> available) _profiler_ that is compatible with JDK 17 / 18
>>>>>>>>>> already?
>>>>>>>>>>
>>>>>>>>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore
>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Aha - it seems like you are seeing what I was seeing:
>>>>>>>>>>> unrolling now seems to happen more reliably, which
>>>>>>>>>>> positively affect code like strlen.
>>>>>>>>>>>
>>>>>>>>>>> As for FUSE, I think the reason for the difference has
>>>>>>>>>>> probably nothing to do with string conversion - the sampler
>>>>>>>>>>> profiler just happens to hit that code a lot. I checked JNR
>>>>>>>>>>> code for string conversion and I couldn't really find
>>>>>>>>>>> anything uber optimized in that regard that could explain
>>>>>>>>>>> the gap.
>>>>>>>>>>>
>>>>>>>>>>> Probably something is not getting optimized as it should -
>>>>>>>>>>> likely a downcall/upcall intrinsification is failing - maybe
>>>>>>>>>>> due to a subtle issue with your code, or, possibly because
>>>>>>>>>>> you are hitting a non-implemented case (e.g. we do not
>>>>>>>>>>> intrinsify calls which pass arguments on the stack, yet), or
>>>>>>>>>>> because of some other bug.
>>>>>>>>>>>
>>>>>>>>>>> Maurizio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>>>>>>>>> Wow, I stand corrected. I just re-ran the benchmark and
>>>>>>>>>>>> `benchmarkStrlenBase` just got a lot faster!! Your change
>>>>>>>>>>>> inhttps://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$
>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$><https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$
>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$>>
>>>>>>>>>>>> DID have an affect after all.
>>>>>>>>>>>>
>>>>>>>>>>>> Just doesn't impress FUSE very much...
>>>>>>>>>>>>
>>>>>>>>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel
>>>>>>>>>>>>> <sebastian.stenzel at gmail.com
>>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com
>>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yup, I tried the int-approach as well, but with worse
>>>>>>>>>>>>> results... Here is the full
>>>>>>>>>>>>> test:https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$><https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore
>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok. Thanks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried similar experiments where instead of reading 4
>>>>>>>>>>>>>> bytes separately I'd read a single int value, and then
>>>>>>>>>>>>>> use shifts and bitmasking to check for terminators. On
>>>>>>>>>>>>>> paper good, but benchmark results were always worse than
>>>>>>>>>>>>>> the version we have now (at least on Linux).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That said, if you could please share the full string
>>>>>>>>>>>>>> benchmark you have, that'd be helpful, so we can take a
>>>>>>>>>>>>>> look at that, and see what's going wrong (ideally, C2
>>>>>>>>>>>>>> should be the one doing unrolling).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>> I just did a quick snythetic test on a "manually
>>>>>>>>>>>>>>> unrolled" strlen() without any FUSE context.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I experimented with an implementation that looked like
>>>>>>>>>>>>>>> the following and benchmarked it using a 259 byte memory
>>>>>>>>>>>>>>> segment containing a 239 byte string (null byte at index
>>>>>>>>>>>>>>> 240):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>> private static int strlenUnroll4(MemorySegment segment,
>>>>>>>>>>>>>>> long start) {
>>>>>>>>>>>>>>> int offset;
>>>>>>>>>>>>>>> for (offset = 0; offset < segment.byteSize()-3;
>>>>>>>>>>>>>>> offset+=4) {
>>>>>>>>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>> offset + 0);
>>>>>>>>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>> offset + 1);
>>>>>>>>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>> offset + 2);
>>>>>>>>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>> offset + 3);
>>>>>>>>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is
>>>>>>>>>>>>>>> this even faster than directly having 4 different branches?
>>>>>>>>>>>>>>> if (b0 == 0) {
>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>> } else if (b1 == 0) {
>>>>>>>>>>>>>>> return offset + 1;
>>>>>>>>>>>>>>> } else if (b2 == 0) {
>>>>>>>>>>>>>>> return offset + 2;
>>>>>>>>>>>>>>> } else if (b3 == 0) {
>>>>>>>>>>>>>>> return offset + 3;
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no
>>>>>>>>>>>>>>> loop required for the remaining <4 bytes?
>>>>>>>>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>> offset);
>>>>>>>>>>>>>>> if (b == 0) {
>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not even sure how reliable my results are, since I
>>>>>>>>>>>>>>> have no clue about how branch prediction works here...
>>>>>>>>>>>>>>> Neither have I tested the correctness of this
>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore
>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>
>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for reporting back.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We probably need to investigate this a bit more deeply
>>>>>>>>>>>>>>>> and try and reproduce on our side.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One last question: you said that with manual unrolling
>>>>>>>>>>>>>>>> you managed to get 2x faster: did you mean that string
>>>>>>>>>>>>>>>> conversion got 2x faster or that you actually saw your
>>>>>>>>>>>>>>>> FUSE benchmark going 2x faster because of the manual
>>>>>>>>>>>>>>>> unrolling with strings?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>> That, surprisingly, didn't change anything either. But
>>>>>>>>>>>>>>>>> don't worry too much, the performance isn't bad (in
>>>>>>>>>>>>>>>>> absolute figures) and it is by far not the only reason
>>>>>>>>>>>>>>>>> why I consider panama the best solution to create java
>>>>>>>>>>>>>>>>> bindings for c libs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore
>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>
>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Actually, after some bisecting, I found out that the
>>>>>>>>>>>>>>>>>> performance of converting a memory segment into a
>>>>>>>>>>>>>>>>>> string jumped 2x faster with this fix:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Which was integrated after the one I originally
>>>>>>>>>>>>>>>>>> pointed at. They both seem to touch loop optimization
>>>>>>>>>>>>>>>>>> in case of overflows, which the strlen code is
>>>>>>>>>>>>>>>>>> triggering (since the loop limit checks for loop
>>>>>>>>>>>>>>>>>> variable being positive).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is a simple patch which adds a string conversion
>>>>>>>>>>>>>>>>>> test:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +++
>>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME,
>>>>>>>>>>>>>>>>>> true));
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> + MemorySegment segment;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> @Setup
>>>>>>>>>>>>>>>>>> public void setup() {
>>>>>>>>>>>>>>>>>> str = makeString(size);
>>>>>>>>>>>>>>>>>> segmentAllocator =
>>>>>>>>>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size
>>>>>>>>>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>>>>>>>>> + segment = toCString(str, segmentAllocator);
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @TearDown
>>>>>>>>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>> scope.close();
>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> + @Benchmark
>>>>>>>>>>>>>>>>>> + public String panama_str_conv() throws Throwable {
>>>>>>>>>>>>>>>>>> + return CLinker.toJavaString(segment);
>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> @Benchmark
>>>>>>>>>>>>>>>>>> public int jni_strlen() throws Throwable {
>>>>>>>>>>>>>>>>>> return strlen(str);
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>> Benchmark (size) Mode Cnt
>>>>>>>>>>>>>>>>>> Score Error Units
>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv 100 avgt 30
>>>>>>>>>>>>>>>>>> 106.613 ? 7.060 ns/op
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>> Benchmark (size) Mode Cnt Score
>>>>>>>>>>>>>>>>>> Error Units
>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv 100 avgt 30 48.120
>>>>>>>>>>>>>>>>>> ? 0.557 ns/op
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So, as you can see, a pretty sizeable jump.
>>>>>>>>>>>>>>>>>> Eyeballing, the shape of generated code doesn't look
>>>>>>>>>>>>>>>>>> too different, which makes me think of another case
>>>>>>>>>>>>>>>>>> where loop is unrolled, but main loop never executed
>>>>>>>>>>>>>>>>>> (similar to JDK-8269230), but we'll need to look deeper.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for
>>>>>>>>>>>>>>>>>>>> details how I built the JDK, see my initial email).
>>>>>>>>>>>>>>>>>>>> Maybe I'm missing some compiler flags to enable all
>>>>>>>>>>>>>>>>>>>> optimizations?
>>>>>>>>>>>>>>>>>>> I see - you do have the latest panama changes, but
>>>>>>>>>>>>>>>>>>> there has been a sync with upstream after that
>>>>>>>>>>>>>>>>>>> changeset, I believe - can you please try to resync
>>>>>>>>>>>>>>>>>>> with the latest foreign-jextract commit - which
>>>>>>>>>>>>>>>>>>> should be:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are you sure about loop vectorization being applied
>>>>>>>>>>>>>>>>>>>> to strlen? I'm not an expert on this field, but I
>>>>>>>>>>>>>>>>>>>> had the impression this wasn't possible when the
>>>>>>>>>>>>>>>>>>>> loop terminates "from within".
>>>>>>>>>>>>>>>>>>> Vlad is the expert here - when chatting offline he
>>>>>>>>>>>>>>>>>>> did mention that loop should have single exit -
>>>>>>>>>>>>>>>>>>> which I guess also takes into account the "normal"
>>>>>>>>>>>>>>>>>>> exit - so the strlen routine would seem to have two
>>>>>>>>>>>>>>>>>>> exits...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore
>>>>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>> thanks for sharing your findings - I've done some
>>>>>>>>>>>>>>>>>>>>> attempts here with a targeted microbenchmark which
>>>>>>>>>>>>>>>>>>>>> measures the performance of string conversion and
>>>>>>>>>>>>>>>>>>>>> I'm seeing unrolling and vectorization being
>>>>>>>>>>>>>>>>>>>>> applied on the strlen computation.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not
>>>>>>>>>>>>>>>>>>>>> been updated in the last few weeks? There has been
>>>>>>>>>>>>>>>>>>>>> a C2 optimization fix which has been added
>>>>>>>>>>>>>>>>>>>>> recently, which I think might be related to this:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>>>>>>>>>>>>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8269230>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> good idea, but it makes no difference beyond
>>>>>>>>>>>>>>>>>>>>>> statistical error.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I started sampling the application with VisualVM
>>>>>>>>>>>>>>>>>>>>>> (which is quite hard, since native threads are
>>>>>>>>>>>>>>>>>>>>>> extremely short-lived. What I noticed is, that
>>>>>>>>>>>>>>>>>>>>>> regardless of where the sampler interrupts a
>>>>>>>>>>>>>>>>>>>>>> thread, in nearly all cases 100% of CPU time are
>>>>>>>>>>>>>>>>>>>>>> caused by
>>>>>>>>>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal()
>>>>>>>>>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to
>>>>>>>>>>>>>>>>>>>>>> the nature of null termination, but maybe we can
>>>>>>>>>>>>>>>>>>>>>> make use of the fact that we're dealing with
>>>>>>>>>>>>>>>>>>>>>> MemorySegments here: Since they protect us from
>>>>>>>>>>>>>>>>>>>>>> overflows, maybe there is no need to look at only
>>>>>>>>>>>>>>>>>>>>>> a single byte at a time. Maybe the strlen()-loop
>>>>>>>>>>>>>>>>>>>>>> can be unrolled or even be vectorized.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup
>>>>>>>>>>>>>>>>>>>>>> when doing a x4 loop unroll.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee
>>>>>>>>>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code,
>>>>>>>>>>>>>>>>>>>>>>> one possible explanation for the discrepancy I
>>>>>>>>>>>>>>>>>>>>>>> can think of is that the DirFiller ends up using
>>>>>>>>>>>>>>>>>>>>>>> virtual downcalls to do it's work, which are
>>>>>>>>>>>>>>>>>>>>>>> currently not intrinsified. Being mostly a case
>>>>>>>>>>>>>>>>>>>>>>> of 'not implemented yet', i.e. it is a known issue.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>> static fuse_fill_dir_t
>>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3)
>>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>>> try {
>>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr,
>>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is
>>>>>>>>>>>>>>>>>>>>>>> not a constant, so the call is virtual
>>>>>>>>>>>>>>>>>>>>>>> } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>> throw new
>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> For testing purposes, a possible workaround
>>>>>>>>>>>>>>>>>>>>>>> could be to have a cache that maps the callback
>>>>>>>>>>>>>>>>>>>>>>> address to a method handle that has the address
>>>>>>>>>>>>>>>>>>>>>>> bound to the first parameter. Assuming readdir
>>>>>>>>>>>>>>>>>>>>>>> always gets the same filler callback address,
>>>>>>>>>>>>>>>>>>>>>>> the same MethodHandle will be reused and
>>>>>>>>>>>>>>>>>>>>>>> eventually customized which means the callback
>>>>>>>>>>>>>>>>>>>>>>> address will become constant, and the downcall
>>>>>>>>>>>>>>>>>>>>>>> should then be intrinsified.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine
>>>>>>>>>>>>>>>>>>>>>>> to test this, but if you want to try it out, the
>>>>>>>>>>>>>>>>>>>>>>> patch should be this:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> +++
>>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>>>>>>>> package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>>>>>>>> import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>>>>>>>> import java.nio.ByteOrder;
>>>>>>>>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>>>>>>>> import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>>>>>>>> public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface
>>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class,
>>>>>>>>>>>>>>>>>>>>>>> fi, constants$0.fuse_fill_dir_t$FUNC,
>>>>>>>>>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I",
>>>>>>>>>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> static fuse_fill_dir_t
>>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>> - return
>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3)
>>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>>> - try {
>>>>>>>>>>>>>>>>>>>>>>> - return
>>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr,
>>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>> - } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>> - throw new
>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>> - }
>>>>>>>>>>>>>>>>>>>>>>> - };
>>>>>>>>>>>>>>>>>>>>>>> + class CacheHolder {
>>>>>>>>>>>>>>>>>>>>>>> + static final Map<MemoryAddress,
>>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>>> CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>>>>>>>>>>>>>>>>>> + final MethodHandle target =
>>>>>>>>>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH,
>>>>>>>>>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3)
>>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>>> + try {
>>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>>> (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>> + } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>> + throw new
>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>>>>>>>>>> + });
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too
>>>>>>>>>>>>>>>>>>>>>>> much by line wrapping)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark
>>>>>>>>>>>>>>>>>>>>>>>> test, that includes several down- and upcalls.
>>>>>>>>>>>>>>>>>>>>>>>> First, let me explain, what I'm testing here:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding,
>>>>>>>>>>>>>>>>>>>>>>>> mostly for experimental purposes right now, and
>>>>>>>>>>>>>>>>>>>>>>>> I'm trying to beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> While there are some other interesting metrics,
>>>>>>>>>>>>>>>>>>>>>>>> such as read/write performance (both
>>>>>>>>>>>>>>>>>>>>>>>> sequentially and random access), I focused on
>>>>>>>>>>>>>>>>>>>>>>>> directory listings for now. Directory listings
>>>>>>>>>>>>>>>>>>>>>>>> are the most complex operation in regards to
>>>>>>>>>>>>>>>>>>>>>>>> the number of down- and upcalls:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback
>>>>>>>>>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in
>>>>>>>>>>>>>>>>>>>>>>>> the directory
>>>>>>>>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no
>>>>>>>>>>>>>>>>>>>>>>>> longer required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces
>>>>>>>>>>>>>>>>>>>>>>>> additional noise (such as readxattr and trying
>>>>>>>>>>>>>>>>>>>>>>>> to access files that I didn't report in readdir))
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this:
>>>>>>>>>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();`
>>>>>>>>>>>>>>>>>>>>>>>> with the volume reporting eight files [2]. When
>>>>>>>>>>>>>>>>>>>>>>>> mounting with debug logs enabled, I can see
>>>>>>>>>>>>>>>>>>>>>>>> that the exact same operations in the same
>>>>>>>>>>>>>>>>>>>>>>>> order are invoked on both fuse-jnr and
>>>>>>>>>>>>>>>>>>>>>>>> fuse-panama. One single dir listing results in
>>>>>>>>>>>>>>>>>>>>>>>> 2 readdir upcalls, 10 callback downcalls, 16
>>>>>>>>>>>>>>>>>>>>>>>> getattr upcalls. There are also 8 getxattr
>>>>>>>>>>>>>>>>>>>>>>>> calls and 16 lookup calls, however they don't
>>>>>>>>>>>>>>>>>>>>>>>> reach Java, as the FUSE kernel knows they are
>>>>>>>>>>>>>>>>>>>>>>>> not implemented.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>> Benchmark Mode Cnt
>>>>>>>>>>>>>>>>>>>>>>>> Score Error Units
>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr avgt 5
>>>>>>>>>>>>>>>>>>>>>>>> 66,569 ± 3,128 us/op
>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama avgt 5
>>>>>>>>>>>>>>>>>>>>>>>> 189,340 ± 4,275 us/op
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit
>>>>>>>>>>>>>>>>>>>>>>>> 42e03fd7c6a built with: `configure
>>>>>>>>>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/
>>>>>>>>>>>>>>>>>>>>>>>> --with-native-debug-symbols=none
>>>>>>>>>>>>>>>>>>>>>>>> --with-debug-level=release
>>>>>>>>>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm
>>>>>>>>>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from.
>>>>>>>>>>>>>>>>>>>>>>>> Maybe creating a newConfinedScope() during each
>>>>>>>>>>>>>>>>>>>>>>>> upcall [3] is "too much"? Maybe JNR is just
>>>>>>>>>>>>>>>>>>>>>>>> negligently skipping some memory boundary
>>>>>>>>>>>>>>>>>>>>>>>> checks to be faster. The results are not
>>>>>>>>>>>>>>>>>>>>>>>> terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$
>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$
>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$
>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>
More information about the panama-dev
mailing list