Real-Life Benchmark for FUSE's readdir()

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Thu Jul 15 21:09:16 UTC 2021


Hah - hit "Send" too soon - of course the numbers below are almost 
useless because the file system is set up on a separate JVM which the 
benchmark cannot see.

I'll try to change the benchmark to run the file system in the same process.

Maurizio

On 15/07/2021 22:05, Maurizio Cimadamore wrote:
> I managed to do an initial benchmark pass with JMH. The numbers don't 
> look great:
>
> ```
> Benchmark                        Mode  Cnt   Score    Error  Units
> BenchmarkTest.testListDirJnr     avgt    5   9.858 ±  0.702  us/op
> BenchmarkTest.testListDirPanama  avgt    5  38.008 ± 17.573  us/op
> ```
>
> On my machine the Panama fuse seems 4x slower than the JNR one 
> (assuming I got the implementation correctly, that is :-)).
>
> A quick look at GC, reveals that Panama allocates 4x _less_ memory 
> than JNR:
>
> ```
> Benchmark Mode  Cnt    Score      Error   Units
> BenchmarkTest.testListDirJnr:·gc.alloc.rate avgt    5   33.255 ±    
> 1.584  MB/sec
> BenchmarkTest.testListDirJnr:·gc.alloc.rate.norm avgt    5 368.033 
> ±    0.047    B/op
> BenchmarkTest.testListDirJnr:·gc.count avgt    5 6.000             counts
> BenchmarkTest.testListDirJnr:·gc.time avgt    5 9.000                 ms
> BenchmarkTest.testListDirPanama:·gc.alloc.rate avgt    5    8.709 ±    
> 3.887  MB/sec
> BenchmarkTest.testListDirPanama:·gc.alloc.rate.norm avgt    5 368.046 
> ±    0.236    B/op
> BenchmarkTest.testListDirPanama:·gc.count avgt    5 2.000             
> counts
> BenchmarkTest.testListDirPanama:·gc.time avgt    5 
> 3.000                 ms
> ```
>
> And, looking with perfasm, the distribution of the various methods 
> look similar, and actually not a lot of time is spent in Java at all:
>
> JNR:
>
> ```
> ...[Hottest Methods (after 
> inlining)]..............................................................
>  90.86%              kernel  [unknown]
>   2.13%         c2, level 4  java.nio.file.Files::list, version 913
>   1.44%         c2, level 4 
> de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirJnr_jmhTest::testListDirJnr_avgt_jmhStub, 
> version 934
>   0.81%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0, 
> version 857
>   0.70%        libc-2.31.so  __close_nocancel
>   0.48%        libc-2.31.so  malloc
>   0.39%        libc-2.31.so  _int_free
>   0.32%        libc-2.31.so  _int_malloc
>   0.29%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0, 
> version 879
>   0.22%        libc-2.31.so  __close
>   0.21%        libc-2.31.so  __GI___libc_open
>   0.19%    Unknown, level 0 
> sun.nio.fs.UnixNativeDispatcher::fdopendir, version 859
>   0.16%        libc-2.31.so  __libc_enable_asynccancel
>   0.16%        libc-2.31.so  __GI___dup
>   0.15%        libc-2.31.so  __fxstat64
>   0.14%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir, 
> version 878
>   0.14%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
>   0.13%        libc-2.31.so  __libc_disable_asynccancel
>   0.13%           libnio.so 
> Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
>   0.13%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
>   0.84%  <...other 69 warm methods...>
> ```
>
> Panama:
>
> ```
> ....[Hottest Methods (after 
> inlining)]..............................................................
>  89.07%              kernel  [unknown]
>   2.35%         c2, level 4  java.nio.file.Files::list, version 890
>   1.45%         c2, level 4 
> de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirPanama_jmhTest::testListDirPanama_avgt_jmhStub, 
> version 917
>   1.21%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0, 
> version 849
>   0.62%        libc-2.31.so  malloc
>   0.60%        libc-2.31.so  __close_nocancel
>   0.47%        libc-2.31.so  _int_free
>   0.46%        libc-2.31.so  _int_malloc
>   0.41%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0, 
> version 869
>   0.32%        libc-2.31.so  __close
>   0.32%        libc-2.31.so  __GI___libc_open
>   0.25%    Unknown, level 0 
> sun.nio.fs.UnixNativeDispatcher::fdopendir, version 851
>   0.23%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
>   0.20%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir, 
> version 868
>   0.17%        libc-2.31.so  __GI___dup
>   0.15%        libc-2.31.so  __alloc_dir
>   0.15%           libnio.so 
> Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
>   0.15%        libc-2.31.so  __libc_disable_asynccancel
>   0.13%        libc-2.31.so  __libc_enable_asynccancel
>   0.12%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
>   1.17%  <...other 81 warm methods...>
> ```
>
> So, no smoking gun so far. I'll keep looking.
>
> Maurizio
>
> On 15/07/2021 19:14, Maurizio Cimadamore wrote:
>> Ok, I got it working - what fixed it for me was that the offsets in 
>> the "filler" calls has to be set to zero, which, looking around, 
>> seems like a default value which leaves the kernel to take care of it.
>>
>> ```
>> $ ls Volumes/foo
>>
>> total 0
>> -r--r--r-- 1 root root  0 Jan  1  1970 aaa
>> -r--r--r-- 1 root root  0 Jan  1  1970 bbb
>> -r--r--r-- 1 root root  0 Jan  1  1970 ccc
>> -r--r--r-- 1 root root  0 Jan  1  1970 ddd
>> -r--r--r-- 1 root root 13 Jan  1  1970 hello.txt
>> -r--r--r-- 1 root root  0 Jan  1  1970 xxx
>> -r--r--r-- 1 root root  0 Jan  1  1970 yyy
>> -r--r--r-- 1 root root  0 Jan  1  1970 zzz
>> ```
>>
>> Something is probably not right (see "total 0" on top), but perhaps 
>> should be good enough for benchmark. Applying same fix on the JNR 
>> also fixes that (and results in same output). Hopefully I'm good to 
>> go now :-)
>>
>> Maurizio
>>
>> On 15/07/2021 18:15, Maurizio Cimadamore wrote:
>>> I've re-extracted on Linux, and it seems more lively now. I had to 
>>> fix a couple of type mismatches (e.g. long vs. int and int vs. 
>>> short) in places, and also some of the fields in the "stat" 
>>> structure are different, so the code won't compile as is.
>>>
>>> After fixing these minor issue, I see a lot more output printed when 
>>> I mount, and when I do an ls, I see the following lines reported:
>>>
>>> [Thread-110] DEBUG 
>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - getattr() /
>>> [Thread-111] DEBUG 
>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - opendir() /
>>> [Thread-112] DEBUG 
>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - readdir() /
>>> [Thread-113] DEBUG 
>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - releasedir() /
>>>
>>> But that's pretty much it, and in the terminal I still get the 
>>> input/output error. I can even debug, but I can't see much of what's 
>>> going wrong (and I'm not familiar with this API) - the Java code 
>>> executes fine, for what it's worth.
>>>
>>> Maurizio
>>>
>>> On 15/07/2021 17:51, Sebastian Stenzel wrote:
>>>> I'll fix it for Linux and let you know!
>>>>
>>>>> On 15. Jul 2021, at 18:04, Maurizio Cimadamore 
>>>>> <maurizio.cimadamore at oracle.com 
>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>
>>>>>
>>>>> On 15/07/2021 17:01, Sebastian Stenzel wrote:
>>>>>> Yeah well they're pretty much mac-specific. On macOS, FUSE has 
>>>>>> this magic behaviour where you can tell it to mount to a 
>>>>>> non-existing mountpoint inside of `/Volumes/...` and it'll just 
>>>>>> create these (and destroy them on unmount). I believe on Linux 
>>>>>> you need to define an _existing_ mount point. But it is surely 
>>>>>> possible that the volume isn't working yet on Linux. I'll give it 
>>>>>> a try myself and
>>>>> I created a folder Volumes under my home folder and trying to use 
>>>>> that as a mount point, which seems to work ok with JNR.
>>>>>> fix it, if required.
>>>>>>
>>>>>> The benchmark then needs to be adjusted for the two mountpoints 
>>>>>> respectively. But before the benchmark can actually do anything, 
>>>>>> a plain `ls` on the terminal needs to work.
>>>>>
>>>>> Ok, it seems even JNR fails:
>>>>>
>>>>> ```
>>>>> $ ls Volumes/bar/
>>>>> ls: reading directory 'Volumes/bar/': Input/output error
>>>>> ```
>>>>>
>>>>>>
>>>>>>> On 15. Jul 2021, at 17:55, Maurizio Cimadamore 
>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>
>>>>>>> I tried to reproduce here on Linux, but with no luck - in the 
>>>>>>> sense that I'm not super sure on how to run the benchmark.
>>>>>>>
>>>>>>> I'm able somehow to run the two examples - and I noted that here 
>>>>>>> JNR works ok, while the Panama one doesn't seem to mount things 
>>>>>>> correctly - a new mount appears on my file explorer, but I'm 
>>>>>>> unable to do anything with it (even unmount - which can only be 
>>>>>>> done at sudo level).
>>>>>>>
>>>>>>> When working with the JNR support, the mount works fine, it 
>>>>>>> shows in the file explorer, I can click on that location and 
>>>>>>> browse, and then unmount from there. Everything works.
>>>>>>>
>>>>>>> That said, the benchmarks require the mount points to be up and 
>>>>>>> running - so I've tried first to execute the example (e.g. JNR) 
>>>>>>> and then run the benchmark in two separate terminal windows, all 
>>>>>>> via Maven - but the benchmark doesn't seem to do anything (I've 
>>>>>>> uncommented the benchmarks of course).
>>>>>>>
>>>>>>> How do you run them?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Maurizio
>>>>>>>
>>>>>>> On 15/07/2021 13:40, Maurizio Cimadamore wrote:
>>>>>>>> I believe that it would be more useful to try to run the 
>>>>>>>> perfasm profiler with JMH.
>>>>>>>>
>>>>>>>> This can be done relatively easily, at least on linux, if you 
>>>>>>>> pass the argument `-prof perfasm` to JMH. (this would need 
>>>>>>>> hsdid-amd64.so on Linux to print readable assembly).
>>>>>>>>
>>>>>>>> Another thing worth checking is allocation rate: `-prof gc`.
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>>>>>>>>> Ok it really seems like VisualVM can't deal with these kinds 
>>>>>>>>> of tasks yet. Now it reports the String constructor being the 
>>>>>>>>> culprit [1], however I strongly doubt that, since this is 
>>>>>>>>> probably one of the most heavily optimized parts of the JDK.
>>>>>>>>>
>>>>>>>>> [1]: Screenshot 
>>>>>>>>> onhttps://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$ 
>>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$><https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$ 
>>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>> 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel 
>>>>>>>>>> <sebastian.stenzel at gmail.com 
>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com 
>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Yes, must be a sampling error. Do you know of a (publicly 
>>>>>>>>>> available) _profiler_ that is compatible with JDK 17 / 18 
>>>>>>>>>> already?
>>>>>>>>>>
>>>>>>>>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore 
>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Aha - it seems like you are seeing what I was seeing: 
>>>>>>>>>>> unrolling now seems to happen more reliably, which 
>>>>>>>>>>> positively affect code like strlen.
>>>>>>>>>>>
>>>>>>>>>>> As for FUSE, I think the reason for the difference has 
>>>>>>>>>>> probably nothing to do with string conversion - the sampler 
>>>>>>>>>>> profiler just happens to hit that code a lot. I checked JNR 
>>>>>>>>>>> code for string conversion and I couldn't really find 
>>>>>>>>>>> anything uber optimized in that regard that could explain 
>>>>>>>>>>> the gap.
>>>>>>>>>>>
>>>>>>>>>>> Probably something is not getting optimized as it should - 
>>>>>>>>>>> likely a downcall/upcall intrinsification is failing - maybe 
>>>>>>>>>>> due to a subtle issue with your code, or, possibly because 
>>>>>>>>>>> you are hitting a non-implemented case (e.g. we do not 
>>>>>>>>>>> intrinsify calls which pass arguments on the stack, yet), or 
>>>>>>>>>>> because of some other bug.
>>>>>>>>>>>
>>>>>>>>>>> Maurizio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>>>>>>>>> Wow, I stand corrected. I just re-ran the benchmark and 
>>>>>>>>>>>> `benchmarkStrlenBase` just got a lot faster!! Your change 
>>>>>>>>>>>> inhttps://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$ 
>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$><https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$ 
>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$>> 
>>>>>>>>>>>> DID have an affect after all.
>>>>>>>>>>>>
>>>>>>>>>>>> Just doesn't impress FUSE very much...
>>>>>>>>>>>>
>>>>>>>>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel 
>>>>>>>>>>>>> <sebastian.stenzel at gmail.com 
>>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com 
>>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yup, I tried the int-approach as well, but with worse 
>>>>>>>>>>>>> results... Here is the full 
>>>>>>>>>>>>> test:https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$ 
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$><https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$ 
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>> 
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore 
>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok. Thanks.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I tried similar experiments where instead of reading 4 
>>>>>>>>>>>>>> bytes separately I'd read a single int value, and then 
>>>>>>>>>>>>>> use shifts and bitmasking to check for terminators. On 
>>>>>>>>>>>>>> paper good, but benchmark results were always worse than 
>>>>>>>>>>>>>> the version we have now (at least on Linux).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That said, if you could please share the full string 
>>>>>>>>>>>>>> benchmark you have, that'd be helpful, so we can take a 
>>>>>>>>>>>>>> look at that, and see what's going wrong (ideally, C2 
>>>>>>>>>>>>>> should be the one doing unrolling).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>> I just did a quick snythetic test on a "manually 
>>>>>>>>>>>>>>> unrolled" strlen() without any FUSE context.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I experimented with an implementation that looked like 
>>>>>>>>>>>>>>> the following and benchmarked it using a 259 byte memory 
>>>>>>>>>>>>>>> segment containing a 239 byte string (null byte at index 
>>>>>>>>>>>>>>> 240):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>> private static int strlenUnroll4(MemorySegment segment, 
>>>>>>>>>>>>>>> long start) {
>>>>>>>>>>>>>>> int offset;
>>>>>>>>>>>>>>> for (offset = 0; offset < segment.byteSize()-3; 
>>>>>>>>>>>>>>> offset+=4) {
>>>>>>>>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>> offset + 0);
>>>>>>>>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>> offset + 1);
>>>>>>>>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>> offset + 2);
>>>>>>>>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>> offset + 3);
>>>>>>>>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is 
>>>>>>>>>>>>>>> this even faster than directly having 4 different branches?
>>>>>>>>>>>>>>> if (b0 == 0) {
>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>> } else if (b1 == 0) {
>>>>>>>>>>>>>>> return offset + 1;
>>>>>>>>>>>>>>> } else if (b2 == 0) {
>>>>>>>>>>>>>>> return offset + 2;
>>>>>>>>>>>>>>> } else if (b3 == 0) {
>>>>>>>>>>>>>>> return offset + 3;
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no 
>>>>>>>>>>>>>>> loop required for the remaining <4 bytes?
>>>>>>>>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>> offset);
>>>>>>>>>>>>>>> if (b == 0) {
>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not even sure how reliable my results are, since I 
>>>>>>>>>>>>>>> have no clue about how branch prediction works here... 
>>>>>>>>>>>>>>> Neither have I tested the correctness of this 
>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore 
>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com> 
>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for reporting back.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We probably need to investigate this a bit more deeply 
>>>>>>>>>>>>>>>> and try and reproduce on our side.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> One last question: you said that with manual unrolling 
>>>>>>>>>>>>>>>> you managed to get 2x faster: did you mean that string 
>>>>>>>>>>>>>>>> conversion got 2x faster or that you actually saw your 
>>>>>>>>>>>>>>>> FUSE benchmark going 2x faster because of the manual 
>>>>>>>>>>>>>>>> unrolling with strings?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>> That, surprisingly, didn't change anything either. But 
>>>>>>>>>>>>>>>>> don't worry too much, the performance isn't bad (in 
>>>>>>>>>>>>>>>>> absolute figures) and it is by far not the only reason 
>>>>>>>>>>>>>>>>> why I consider panama the best solution to create java 
>>>>>>>>>>>>>>>>> bindings for c libs.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore 
>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com> 
>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Actually, after some bisecting, I found out that the 
>>>>>>>>>>>>>>>>>> performance of converting a memory segment into a 
>>>>>>>>>>>>>>>>>> string jumped 2x faster with this fix:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$> 
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>> 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Which was integrated after the one I originally 
>>>>>>>>>>>>>>>>>> pointed at. They both seem to touch loop optimization 
>>>>>>>>>>>>>>>>>> in case of overflows, which the strlen code is 
>>>>>>>>>>>>>>>>>> triggering (since the loop limit checks for loop 
>>>>>>>>>>>>>>>>>> variable being positive).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This is a simple patch which adds a string conversion 
>>>>>>>>>>>>>>>>>> test:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>> diff --git 
>>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>>>>>>>>> --- 
>>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +++ 
>>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, 
>>>>>>>>>>>>>>>>>> true));
>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +    MemorySegment segment;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>      @Setup
>>>>>>>>>>>>>>>>>>      public void setup() {
>>>>>>>>>>>>>>>>>>          str = makeString(size);
>>>>>>>>>>>>>>>>>>          segmentAllocator = 
>>>>>>>>>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size 
>>>>>>>>>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>>>>>>>>> +        segment = toCString(str, segmentAllocator);
>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>      @TearDown
>>>>>>>>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>>          scope.close();
>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +    @Benchmark
>>>>>>>>>>>>>>>>>> +    public String panama_str_conv() throws Throwable {
>>>>>>>>>>>>>>>>>> +        return CLinker.toJavaString(segment);
>>>>>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>>>>>>>>      public int jni_strlen() throws Throwable {
>>>>>>>>>>>>>>>>>>          return strlen(str);
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt 
>>>>>>>>>>>>>>>>>>    Score   Error Units
>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30 
>>>>>>>>>>>>>>>>>>  106.613 ? 7.060 ns/op
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt   Score 
>>>>>>>>>>>>>>>>>>   Error Units
>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  48.120 
>>>>>>>>>>>>>>>>>> ? 0.557 ns/op
>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So, as you can see, a pretty sizeable jump. 
>>>>>>>>>>>>>>>>>> Eyeballing, the shape of generated code doesn't look 
>>>>>>>>>>>>>>>>>> too different, which makes me think of another case 
>>>>>>>>>>>>>>>>>> where loop is unrolled, but main loop never executed 
>>>>>>>>>>>>>>>>>> (similar to JDK-8269230), but we'll need to look deeper.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for 
>>>>>>>>>>>>>>>>>>>> details how I built the JDK, see my initial email). 
>>>>>>>>>>>>>>>>>>>> Maybe I'm missing some compiler flags to enable all 
>>>>>>>>>>>>>>>>>>>> optimizations?
>>>>>>>>>>>>>>>>>>> I see - you do have the latest panama changes, but 
>>>>>>>>>>>>>>>>>>> there has been a sync with upstream after that 
>>>>>>>>>>>>>>>>>>> changeset, I believe - can you please try to resync 
>>>>>>>>>>>>>>>>>>> with the latest foreign-jextract commit - which 
>>>>>>>>>>>>>>>>>>> should be:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$> 
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>> 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Are you sure about loop vectorization being applied 
>>>>>>>>>>>>>>>>>>>> to strlen? I'm not an expert on this field, but I 
>>>>>>>>>>>>>>>>>>>> had the impression this wasn't possible when the 
>>>>>>>>>>>>>>>>>>>> loop terminates "from within".
>>>>>>>>>>>>>>>>>>> Vlad is the expert here - when chatting offline he 
>>>>>>>>>>>>>>>>>>> did mention that loop should have single exit - 
>>>>>>>>>>>>>>>>>>> which I guess also takes into account the "normal" 
>>>>>>>>>>>>>>>>>>> exit - so the strlen routine would seem to have two 
>>>>>>>>>>>>>>>>>>> exits...
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore 
>>>>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>> thanks for sharing your findings - I've done some 
>>>>>>>>>>>>>>>>>>>>> attempts here with a targeted microbenchmark which 
>>>>>>>>>>>>>>>>>>>>> measures the performance of string conversion and 
>>>>>>>>>>>>>>>>>>>>> I'm seeing unrolling and vectorization being 
>>>>>>>>>>>>>>>>>>>>> applied on the strlen computation.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not 
>>>>>>>>>>>>>>>>>>>>> been updated in the last few weeks? There has been 
>>>>>>>>>>>>>>>>>>>>> a C2 optimization fix which has been added 
>>>>>>>>>>>>>>>>>>>>> recently, which I think might be related to this:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230 
>>>>>>>>>>>>>>>>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8269230>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> good idea, but it makes no difference beyond 
>>>>>>>>>>>>>>>>>>>>>> statistical error.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I started sampling the application with VisualVM 
>>>>>>>>>>>>>>>>>>>>>> (which is quite hard, since native threads are 
>>>>>>>>>>>>>>>>>>>>>> extremely short-lived. What I noticed is, that 
>>>>>>>>>>>>>>>>>>>>>> regardless of where the sampler interrupts a 
>>>>>>>>>>>>>>>>>>>>>> thread, in nearly all cases 100% of CPU time are 
>>>>>>>>>>>>>>>>>>>>>> caused by 
>>>>>>>>>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() 
>>>>>>>>>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to 
>>>>>>>>>>>>>>>>>>>>>> the nature of null termination, but maybe we can 
>>>>>>>>>>>>>>>>>>>>>> make use of the fact that we're dealing with 
>>>>>>>>>>>>>>>>>>>>>> MemorySegments here: Since they protect us from 
>>>>>>>>>>>>>>>>>>>>>> overflows, maybe there is no need to look at only 
>>>>>>>>>>>>>>>>>>>>>> a single byte at a time. Maybe the strlen()-loop 
>>>>>>>>>>>>>>>>>>>>>> can be unrolled or even be vectorized.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup 
>>>>>>>>>>>>>>>>>>>>>> when doing a x4 loop unroll.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee 
>>>>>>>>>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code, 
>>>>>>>>>>>>>>>>>>>>>>> one possible explanation for the discrepancy I 
>>>>>>>>>>>>>>>>>>>>>>> can think of is that the DirFiller ends up using 
>>>>>>>>>>>>>>>>>>>>>>> virtual downcalls to do it's work, which are 
>>>>>>>>>>>>>>>>>>>>>>> currently not intrinsified. Being mostly a case 
>>>>>>>>>>>>>>>>>>>>>>> of 'not implemented yet', i.e. it is a known issue.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>      static fuse_fill_dir_t 
>>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>>          return 
>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) 
>>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>>>              try {
>>>>>>>>>>>>>>>>>>>>>>>                  return 
>>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is 
>>>>>>>>>>>>>>>>>>>>>>> not a constant, so the call is virtual
>>>>>>>>>>>>>>>>>>>>>>>              } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>                  throw new 
>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>              }
>>>>>>>>>>>>>>>>>>>>>>>          };
>>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> For testing purposes, a possible workaround 
>>>>>>>>>>>>>>>>>>>>>>> could be to have a cache that maps the callback 
>>>>>>>>>>>>>>>>>>>>>>> address to a method handle that has the address 
>>>>>>>>>>>>>>>>>>>>>>> bound to the first parameter. Assuming readdir 
>>>>>>>>>>>>>>>>>>>>>>> always gets the same filler callback address, 
>>>>>>>>>>>>>>>>>>>>>>> the same MethodHandle will be reused and 
>>>>>>>>>>>>>>>>>>>>>>> eventually customized which means the callback 
>>>>>>>>>>>>>>>>>>>>>>> address will become constant, and the downcall 
>>>>>>>>>>>>>>>>>>>>>>> should then be intrinsified.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine 
>>>>>>>>>>>>>>>>>>>>>>> to test this, but if you want to try it out, the 
>>>>>>>>>>>>>>>>>>>>>>> patch should be this:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>> diff --git 
>>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>>>>>>>>> --- 
>>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> +++ 
>>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>>>>>>>>   package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>>   import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>>>>>>>>   import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>>>>>>>>   public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface 
>>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>>           return 
>>>>>>>>>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class, 
>>>>>>>>>>>>>>>>>>>>>>> fi, constants$0.fuse_fill_dir_t$FUNC, 
>>>>>>>>>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", 
>>>>>>>>>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>>>       static fuse_fill_dir_t 
>>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>> -        return 
>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) 
>>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>>> -            try {
>>>>>>>>>>>>>>>>>>>>>>> -                return 
>>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>> -            } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>> -                throw new 
>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>>>>>>>>>> -        };
>>>>>>>>>>>>>>>>>>>>>>> +        class CacheHolder {
>>>>>>>>>>>>>>>>>>>>>>> +            static final Map<MemoryAddress, 
>>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>>>>>>>> +        return 
>>>>>>>>>>>>>>>>>>>>>>> CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>>>>>>>>>>>>>>>>>> +            final MethodHandle target = 
>>>>>>>>>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 
>>>>>>>>>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>>>>>>>>>> +            return 
>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) 
>>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>>> +                try {
>>>>>>>>>>>>>>>>>>>>>>> +                    return 
>>>>>>>>>>>>>>>>>>>>>>> (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>> +                } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>> +                    throw new 
>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>> +                }
>>>>>>>>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>>>>>>>>> +        });
>>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too 
>>>>>>>>>>>>>>>>>>>>>>> much by line wrapping)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark 
>>>>>>>>>>>>>>>>>>>>>>>> test, that includes several down- and upcalls. 
>>>>>>>>>>>>>>>>>>>>>>>> First, let me explain, what I'm testing here:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, 
>>>>>>>>>>>>>>>>>>>>>>>> mostly for experimental purposes right now, and 
>>>>>>>>>>>>>>>>>>>>>>>> I'm trying to beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> While there are some other interesting metrics, 
>>>>>>>>>>>>>>>>>>>>>>>> such as read/write performance (both 
>>>>>>>>>>>>>>>>>>>>>>>> sequentially and random access), I focused on 
>>>>>>>>>>>>>>>>>>>>>>>> directory listings for now. Directory listings 
>>>>>>>>>>>>>>>>>>>>>>>> are the most complex operation in regards to 
>>>>>>>>>>>>>>>>>>>>>>>> the number of down- and upcalls:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback 
>>>>>>>>>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in 
>>>>>>>>>>>>>>>>>>>>>>>> the directory
>>>>>>>>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no 
>>>>>>>>>>>>>>>>>>>>>>>> longer required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces 
>>>>>>>>>>>>>>>>>>>>>>>> additional noise (such as readxattr and trying 
>>>>>>>>>>>>>>>>>>>>>>>> to access files that I didn't report in readdir))
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this: 
>>>>>>>>>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();` 
>>>>>>>>>>>>>>>>>>>>>>>> with the volume reporting eight files [2]. When 
>>>>>>>>>>>>>>>>>>>>>>>> mounting with debug logs enabled, I can see 
>>>>>>>>>>>>>>>>>>>>>>>> that the exact same operations in the same 
>>>>>>>>>>>>>>>>>>>>>>>> order are invoked on both fuse-jnr and 
>>>>>>>>>>>>>>>>>>>>>>>> fuse-panama. One single dir listing results in 
>>>>>>>>>>>>>>>>>>>>>>>> 2 readdir upcalls, 10 callback downcalls, 16 
>>>>>>>>>>>>>>>>>>>>>>>> getattr upcalls. There are also 8 getxattr 
>>>>>>>>>>>>>>>>>>>>>>>> calls and 16 lookup calls, however they don't 
>>>>>>>>>>>>>>>>>>>>>>>> reach Java, as the FUSE kernel knows they are 
>>>>>>>>>>>>>>>>>>>>>>>> not implemented.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>> Benchmark                        Mode  Cnt 
>>>>>>>>>>>>>>>>>>>>>>>>    Score Error  Units
>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5 
>>>>>>>>>>>>>>>>>>>>>>>>   66,569 ± 3,128  us/op
>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5 
>>>>>>>>>>>>>>>>>>>>>>>>  189,340 ± 4,275  us/op
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit 
>>>>>>>>>>>>>>>>>>>>>>>> 42e03fd7c6a built with: `configure 
>>>>>>>>>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ 
>>>>>>>>>>>>>>>>>>>>>>>> --with-native-debug-symbols=none 
>>>>>>>>>>>>>>>>>>>>>>>> --with-debug-level=release 
>>>>>>>>>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm 
>>>>>>>>>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from. 
>>>>>>>>>>>>>>>>>>>>>>>> Maybe creating a newConfinedScope() during each 
>>>>>>>>>>>>>>>>>>>>>>>> upcall [3] is "too much"? Maybe JNR is just 
>>>>>>>>>>>>>>>>>>>>>>>> negligently skipping some memory boundary 
>>>>>>>>>>>>>>>>>>>>>>>> checks to be faster. The results are not 
>>>>>>>>>>>>>>>>>>>>>>>> terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>


More information about the panama-dev mailing list