Real-Life Benchmark for FUSE's readdir()

Thu Jul 15 21:05:38 UTC 2021

I managed to do an initial benchmark pass with JMH. The numbers don't 
look great:

```
Benchmark                        Mode  Cnt   Score    Error  Units
BenchmarkTest.testListDirJnr     avgt    5   9.858 ±  0.702  us/op
BenchmarkTest.testListDirPanama  avgt    5  38.008 ± 17.573  us/op
```

On my machine the Panama fuse seems 4x slower than the JNR one (assuming 
I got the implementation correctly, that is :-)).

A quick look at GC, reveals that Panama allocates 4x _less_ memory than JNR:

```
Benchmark Mode  Cnt    Score      Error   Units
BenchmarkTest.testListDirJnr:·gc.alloc.rate avgt    5   33.255 ±    
1.584  MB/sec
BenchmarkTest.testListDirJnr:·gc.alloc.rate.norm avgt    5  368.033 ±    
0.047    B/op
BenchmarkTest.testListDirJnr:·gc.count avgt    5    6.000             counts
BenchmarkTest.testListDirJnr:·gc.time avgt    5    9.000                 ms
BenchmarkTest.testListDirPanama:·gc.alloc.rate avgt    5    8.709 ±    
3.887  MB/sec
BenchmarkTest.testListDirPanama:·gc.alloc.rate.norm avgt    5  368.046 
±    0.236    B/op
BenchmarkTest.testListDirPanama:·gc.count avgt    5    2.000             
counts
BenchmarkTest.testListDirPanama:·gc.time avgt    5    
3.000                 ms
```

And, looking with perfasm, the distribution of the various methods look 
similar, and actually not a lot of time is spent in Java at all:

JNR:

```
...[Hottest Methods (after 
inlining)]..............................................................
  90.86%              kernel  [unknown]
   2.13%         c2, level 4  java.nio.file.Files::list, version 913
   1.44%         c2, level 4 
de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirJnr_jmhTest::testListDirJnr_avgt_jmhStub, 
version 934
   0.81%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0, 
version 857
   0.70%        libc-2.31.so  __close_nocancel
   0.48%        libc-2.31.so  malloc
   0.39%        libc-2.31.so  _int_free
   0.32%        libc-2.31.so  _int_malloc
   0.29%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0, 
version 879
   0.22%        libc-2.31.so  __close
   0.21%        libc-2.31.so  __GI___libc_open
   0.19%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::fdopendir, 
version 859
   0.16%        libc-2.31.so  __libc_enable_asynccancel
   0.16%        libc-2.31.so  __GI___dup
   0.15%        libc-2.31.so  __fxstat64
   0.14%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir, 
version 878
   0.14%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
   0.13%        libc-2.31.so  __libc_disable_asynccancel
   0.13%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
   0.13%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
   0.84%  <...other 69 warm methods...>
```

Panama:

```
....[Hottest Methods (after 
inlining)]..............................................................
  89.07%              kernel  [unknown]
   2.35%         c2, level 4  java.nio.file.Files::list, version 890
   1.45%         c2, level 4 
de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirPanama_jmhTest::testListDirPanama_avgt_jmhStub, 
version 917
   1.21%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0, 
version 849
   0.62%        libc-2.31.so  malloc
   0.60%        libc-2.31.so  __close_nocancel
   0.47%        libc-2.31.so  _int_free
   0.46%        libc-2.31.so  _int_malloc
   0.41%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0, 
version 869
   0.32%        libc-2.31.so  __close
   0.32%        libc-2.31.so  __GI___libc_open
   0.25%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::fdopendir, 
version 851
   0.23%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
   0.20%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir, 
version 868
   0.17%        libc-2.31.so  __GI___dup
   0.15%        libc-2.31.so  __alloc_dir
   0.15%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
   0.15%        libc-2.31.so  __libc_disable_asynccancel
   0.13%        libc-2.31.so  __libc_enable_asynccancel
   0.12%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
   1.17%  <...other 81 warm methods...>
```

So, no smoking gun so far. I'll keep looking.

Maurizio

On 15/07/2021 19:14, Maurizio Cimadamore wrote:
> Ok, I got it working - what fixed it for me was that the offsets in 
> the "filler" calls has to be set to zero, which, looking around, seems 
> like a default value which leaves the kernel to take care of it.
>
> ```
> $ ls Volumes/foo
>
> total 0
> -r--r--r-- 1 root root  0 Jan  1  1970 aaa
> -r--r--r-- 1 root root  0 Jan  1  1970 bbb
> -r--r--r-- 1 root root  0 Jan  1  1970 ccc
> -r--r--r-- 1 root root  0 Jan  1  1970 ddd
> -r--r--r-- 1 root root 13 Jan  1  1970 hello.txt
> -r--r--r-- 1 root root  0 Jan  1  1970 xxx
> -r--r--r-- 1 root root  0 Jan  1  1970 yyy
> -r--r--r-- 1 root root  0 Jan  1  1970 zzz
> ```
>
> Something is probably not right (see "total 0" on top), but perhaps 
> should be good enough for benchmark. Applying same fix on the JNR also 
> fixes that (and results in same output). Hopefully I'm good to go now :-)
>
> Maurizio
>
> On 15/07/2021 18:15, Maurizio Cimadamore wrote:
>> I've re-extracted on Linux, and it seems more lively now. I had to 
>> fix a couple of type mismatches (e.g. long vs. int and int vs. short) 
>> in places, and also some of the fields in the "stat" structure are 
>> different, so the code won't compile as is.
>>
>> After fixing these minor issue, I see a lot more output printed when 
>> I mount, and when I do an ls, I see the following lines reported:
>>
>> [Thread-110] DEBUG 
>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - getattr() /
>> [Thread-111] DEBUG 
>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - opendir() /
>> [Thread-112] DEBUG 
>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - readdir() /
>> [Thread-113] DEBUG 
>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - releasedir() /
>>
>> But that's pretty much it, and in the terminal I still get the 
>> input/output error. I can even debug, but I can't see much of what's 
>> going wrong (and I'm not familiar with this API) - the Java code 
>> executes fine, for what it's worth.
>>
>> Maurizio
>>
>> On 15/07/2021 17:51, Sebastian Stenzel wrote:
>>> I'll fix it for Linux and let you know!
>>>
>>>> On 15. Jul 2021, at 18:04, Maurizio Cimadamore 
>>>> <maurizio.cimadamore at oracle.com 
>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>
>>>>
>>>> On 15/07/2021 17:01, Sebastian Stenzel wrote:
>>>>> Yeah well they're pretty much mac-specific. On macOS, FUSE has 
>>>>> this magic behaviour where you can tell it to mount to a 
>>>>> non-existing mountpoint inside of `/Volumes/...` and it'll just 
>>>>> create these (and destroy them on unmount). I believe on Linux you 
>>>>> need to define an _existing_ mount point. But it is surely 
>>>>> possible that the volume isn't working yet on Linux. I'll give it 
>>>>> a try myself and
>>>> I created a folder Volumes under my home folder and trying to use 
>>>> that as a mount point, which seems to work ok with JNR.
>>>>> fix it, if required.
>>>>>
>>>>> The benchmark then needs to be adjusted for the two mountpoints 
>>>>> respectively. But before the benchmark can actually do anything, a 
>>>>> plain `ls` on the terminal needs to work.
>>>>
>>>> Ok, it seems even JNR fails:
>>>>
>>>> ```
>>>> $ ls Volumes/bar/
>>>> ls: reading directory 'Volumes/bar/': Input/output error
>>>> ```
>>>>
>>>>>
>>>>>> On 15. Jul 2021, at 17:55, Maurizio Cimadamore 
>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>
>>>>>> I tried to reproduce here on Linux, but with no luck - in the 
>>>>>> sense that I'm not super sure on how to run the benchmark.
>>>>>>
>>>>>> I'm able somehow to run the two examples - and I noted that here 
>>>>>> JNR works ok, while the Panama one doesn't seem to mount things 
>>>>>> correctly - a new mount appears on my file explorer, but I'm 
>>>>>> unable to do anything with it (even unmount - which can only be 
>>>>>> done at sudo level).
>>>>>>
>>>>>> When working with the JNR support, the mount works fine, it shows 
>>>>>> in the file explorer, I can click on that location and browse, 
>>>>>> and then unmount from there. Everything works.
>>>>>>
>>>>>> That said, the benchmarks require the mount points to be up and 
>>>>>> running - so I've tried first to execute the example (e.g. JNR) 
>>>>>> and then run the benchmark in two separate terminal windows, all 
>>>>>> via Maven - but the benchmark doesn't seem to do anything (I've 
>>>>>> uncommented the benchmarks of course).
>>>>>>
>>>>>> How do you run them?
>>>>>>
>>>>>> Thanks
>>>>>> Maurizio
>>>>>>
>>>>>> On 15/07/2021 13:40, Maurizio Cimadamore wrote:
>>>>>>> I believe that it would be more useful to try to run the perfasm 
>>>>>>> profiler with JMH.
>>>>>>>
>>>>>>> This can be done relatively easily, at least on linux, if you 
>>>>>>> pass the argument `-prof perfasm` to JMH. (this would need 
>>>>>>> hsdid-amd64.so on Linux to print readable assembly).
>>>>>>>
>>>>>>> Another thing worth checking is allocation rate: `-prof gc`.
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>>>>>>>> Ok it really seems like VisualVM can't deal with these kinds of 
>>>>>>>> tasks yet. Now it reports the String constructor being the 
>>>>>>>> culprit [1], however I strongly doubt that, since this is 
>>>>>>>> probably one of the most heavily optimized parts of the JDK.
>>>>>>>>
>>>>>>>> [1]: Screenshot 
>>>>>>>> onhttps://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$ 
>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$><https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$ 
>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>> 
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel 
>>>>>>>>> <sebastian.stenzel at gmail.com 
>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com 
>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>
>>>>>>>>> Yes, must be a sampling error. Do you know of a (publicly 
>>>>>>>>> available) _profiler_ that is compatible with JDK 17 / 18 
>>>>>>>>> already?
>>>>>>>>>
>>>>>>>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore 
>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Aha - it seems like you are seeing what I was seeing: 
>>>>>>>>>> unrolling now seems to happen more reliably, which positively 
>>>>>>>>>> affect code like strlen.
>>>>>>>>>>
>>>>>>>>>> As for FUSE, I think the reason for the difference has 
>>>>>>>>>> probably nothing to do with string conversion - the sampler 
>>>>>>>>>> profiler just happens to hit that code a lot. I checked JNR 
>>>>>>>>>> code for string conversion and I couldn't really find 
>>>>>>>>>> anything uber optimized in that regard that could explain the 
>>>>>>>>>> gap.
>>>>>>>>>>
>>>>>>>>>> Probably something is not getting optimized as it should - 
>>>>>>>>>> likely a downcall/upcall intrinsification is failing - maybe 
>>>>>>>>>> due to a subtle issue with your code, or, possibly because 
>>>>>>>>>> you are hitting a non-implemented case (e.g. we do not 
>>>>>>>>>> intrinsify calls which pass arguments on the stack, yet), or 
>>>>>>>>>> because of some other bug.
>>>>>>>>>>
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>>>>>>>> Wow, I stand corrected. I just re-ran the benchmark and 
>>>>>>>>>>> `benchmarkStrlenBase` just got a lot faster!! Your change 
>>>>>>>>>>> inhttps://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$><https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$>> 
>>>>>>>>>>> DID have an affect after all.
>>>>>>>>>>>
>>>>>>>>>>> Just doesn't impress FUSE very much...
>>>>>>>>>>>
>>>>>>>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel 
>>>>>>>>>>>> <sebastian.stenzel at gmail.com 
>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com 
>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Yup, I tried the int-approach as well, but with worse 
>>>>>>>>>>>> results... Here is the full 
>>>>>>>>>>>> test:https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$ 
>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$><https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$ 
>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>> 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore 
>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok. Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried similar experiments where instead of reading 4 
>>>>>>>>>>>>> bytes separately I'd read a single int value, and then use 
>>>>>>>>>>>>> shifts and bitmasking to check for terminators. On paper 
>>>>>>>>>>>>> good, but benchmark results were always worse than the 
>>>>>>>>>>>>> version we have now (at least on Linux).
>>>>>>>>>>>>>
>>>>>>>>>>>>> That said, if you could please share the full string 
>>>>>>>>>>>>> benchmark you have, that'd be helpful, so we can take a 
>>>>>>>>>>>>> look at that, and see what's going wrong (ideally, C2 
>>>>>>>>>>>>> should be the one doing unrolling).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>>>>>>>> I just did a quick snythetic test on a "manually 
>>>>>>>>>>>>>> unrolled" strlen() without any FUSE context.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I experimented with an implementation that looked like 
>>>>>>>>>>>>>> the following and benchmarked it using a 259 byte memory 
>>>>>>>>>>>>>> segment containing a 239 byte string (null byte at index 
>>>>>>>>>>>>>> 240):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> private static int strlenUnroll4(MemorySegment segment, 
>>>>>>>>>>>>>> long start) {
>>>>>>>>>>>>>> int offset;
>>>>>>>>>>>>>> for (offset = 0; offset < segment.byteSize()-3; offset+=4) {
>>>>>>>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>> offset + 0);
>>>>>>>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>> offset + 1);
>>>>>>>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>> offset + 2);
>>>>>>>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>> offset + 3);
>>>>>>>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is 
>>>>>>>>>>>>>> this even faster than directly having 4 different branches?
>>>>>>>>>>>>>> if (b0 == 0) {
>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>> } else if (b1 == 0) {
>>>>>>>>>>>>>> return offset + 1;
>>>>>>>>>>>>>> } else if (b2 == 0) {
>>>>>>>>>>>>>> return offset + 2;
>>>>>>>>>>>>>> } else if (b3 == 0) {
>>>>>>>>>>>>>> return offset + 3;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no 
>>>>>>>>>>>>>> loop required for the remaining <4 bytes?
>>>>>>>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>> offset);
>>>>>>>>>>>>>> if (b == 0) {
>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not even sure how reliable my results are, since I 
>>>>>>>>>>>>>> have no clue about how branch prediction works here... 
>>>>>>>>>>>>>> Neither have I tested the correctness of this 
>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore 
>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com> 
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for reporting back.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We probably need to investigate this a bit more deeply 
>>>>>>>>>>>>>>> and try and reproduce on our side.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> One last question: you said that with manual unrolling 
>>>>>>>>>>>>>>> you managed to get 2x faster: did you mean that string 
>>>>>>>>>>>>>>> conversion got 2x faster or that you actually saw your 
>>>>>>>>>>>>>>> FUSE benchmark going 2x faster because of the manual 
>>>>>>>>>>>>>>> unrolling with strings?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>> That, surprisingly, didn't change anything either. But 
>>>>>>>>>>>>>>>> don't worry too much, the performance isn't bad (in 
>>>>>>>>>>>>>>>> absolute figures) and it is by far not the only reason 
>>>>>>>>>>>>>>>> why I consider panama the best solution to create java 
>>>>>>>>>>>>>>>> bindings for c libs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore 
>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com> 
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Actually, after some bisecting, I found out that the 
>>>>>>>>>>>>>>>>> performance of converting a memory segment into a 
>>>>>>>>>>>>>>>>> string jumped 2x faster with this fix:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$> 
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>> 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Which was integrated after the one I originally 
>>>>>>>>>>>>>>>>> pointed at. They both seem to touch loop optimization 
>>>>>>>>>>>>>>>>> in case of overflows, which the strlen code is 
>>>>>>>>>>>>>>>>> triggering (since the loop limit checks for loop 
>>>>>>>>>>>>>>>>> variable being positive).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is a simple patch which adds a string conversion 
>>>>>>>>>>>>>>>>> test:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>> diff --git 
>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>>>>>>>> --- 
>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +++ 
>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, 
>>>>>>>>>>>>>>>>> true));
>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +    MemorySegment segment;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>      @Setup
>>>>>>>>>>>>>>>>>      public void setup() {
>>>>>>>>>>>>>>>>>          str = makeString(size);
>>>>>>>>>>>>>>>>>          segmentAllocator = 
>>>>>>>>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size 
>>>>>>>>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>>>>>>>> +        segment = toCString(str, segmentAllocator);
>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>      @TearDown
>>>>>>>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>          scope.close();
>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +    @Benchmark
>>>>>>>>>>>>>>>>> +    public String panama_str_conv() throws Throwable {
>>>>>>>>>>>>>>>>> +        return CLinker.toJavaString(segment);
>>>>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>>>>>>>      public int jni_strlen() throws Throwable {
>>>>>>>>>>>>>>>>>          return strlen(str);
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt    Score 
>>>>>>>>>>>>>>>>>   Error Units
>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  106.613 
>>>>>>>>>>>>>>>>> ? 7.060 ns/op
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt   Score 
>>>>>>>>>>>>>>>>>   Error Units
>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  48.120 
>>>>>>>>>>>>>>>>> ? 0.557 ns/op
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So, as you can see, a pretty sizeable jump. 
>>>>>>>>>>>>>>>>> Eyeballing, the shape of generated code doesn't look 
>>>>>>>>>>>>>>>>> too different, which makes me think of another case 
>>>>>>>>>>>>>>>>> where loop is unrolled, but main loop never executed 
>>>>>>>>>>>>>>>>> (similar to JDK-8269230), but we'll need to look deeper.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for 
>>>>>>>>>>>>>>>>>>> details how I built the JDK, see my initial email). 
>>>>>>>>>>>>>>>>>>> Maybe I'm missing some compiler flags to enable all 
>>>>>>>>>>>>>>>>>>> optimizations?
>>>>>>>>>>>>>>>>>> I see - you do have the latest panama changes, but 
>>>>>>>>>>>>>>>>>> there has been a sync with upstream after that 
>>>>>>>>>>>>>>>>>> changeset, I believe - can you please try to resync 
>>>>>>>>>>>>>>>>>> with the latest foreign-jextract commit - which 
>>>>>>>>>>>>>>>>>> should be:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$> 
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>> 
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are you sure about loop vectorization being applied 
>>>>>>>>>>>>>>>>>>> to strlen? I'm not an expert on this field, but I 
>>>>>>>>>>>>>>>>>>> had the impression this wasn't possible when the 
>>>>>>>>>>>>>>>>>>> loop terminates "from within".
>>>>>>>>>>>>>>>>>> Vlad is the expert here - when chatting offline he 
>>>>>>>>>>>>>>>>>> did mention that loop should have single exit - which 
>>>>>>>>>>>>>>>>>> I guess also takes into account the "normal" exit - 
>>>>>>>>>>>>>>>>>> so the strlen routine would seem to have two exits...
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore 
>>>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>> thanks for sharing your findings - I've done some 
>>>>>>>>>>>>>>>>>>>> attempts here with a targeted microbenchmark which 
>>>>>>>>>>>>>>>>>>>> measures the performance of string conversion and 
>>>>>>>>>>>>>>>>>>>> I'm seeing unrolling and vectorization being 
>>>>>>>>>>>>>>>>>>>> applied on the strlen computation.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not been 
>>>>>>>>>>>>>>>>>>>> updated in the last few weeks? There has been a C2 
>>>>>>>>>>>>>>>>>>>> optimization fix which has been added recently, 
>>>>>>>>>>>>>>>>>>>> which I think might be related to this:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230 
>>>>>>>>>>>>>>>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8269230>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> good idea, but it makes no difference beyond 
>>>>>>>>>>>>>>>>>>>>> statistical error.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I started sampling the application with VisualVM 
>>>>>>>>>>>>>>>>>>>>> (which is quite hard, since native threads are 
>>>>>>>>>>>>>>>>>>>>> extremely short-lived. What I noticed is, that 
>>>>>>>>>>>>>>>>>>>>> regardless of where the sampler interrupts a 
>>>>>>>>>>>>>>>>>>>>> thread, in nearly all cases 100% of CPU time are 
>>>>>>>>>>>>>>>>>>>>> caused by 
>>>>>>>>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() 
>>>>>>>>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to 
>>>>>>>>>>>>>>>>>>>>> the nature of null termination, but maybe we can 
>>>>>>>>>>>>>>>>>>>>> make use of the fact that we're dealing with 
>>>>>>>>>>>>>>>>>>>>> MemorySegments here: Since they protect us from 
>>>>>>>>>>>>>>>>>>>>> overflows, maybe there is no need to look at only 
>>>>>>>>>>>>>>>>>>>>> a single byte at a time. Maybe the strlen()-loop 
>>>>>>>>>>>>>>>>>>>>> can be unrolled or even be vectorized.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup 
>>>>>>>>>>>>>>>>>>>>> when doing a x4 loop unroll.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee 
>>>>>>>>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code, 
>>>>>>>>>>>>>>>>>>>>>> one possible explanation for the discrepancy I 
>>>>>>>>>>>>>>>>>>>>>> can think of is that the DirFiller ends up using 
>>>>>>>>>>>>>>>>>>>>>> virtual downcalls to do it's work, which are 
>>>>>>>>>>>>>>>>>>>>>> currently not intrinsified. Being mostly a case 
>>>>>>>>>>>>>>>>>>>>>> of 'not implemented yet', i.e. it is a known issue.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>      static fuse_fill_dir_t 
>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>          return 
>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) 
>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>>              try {
>>>>>>>>>>>>>>>>>>>>>>                  return 
>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is not 
>>>>>>>>>>>>>>>>>>>>>> a constant, so the call is virtual
>>>>>>>>>>>>>>>>>>>>>>              } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>                  throw new AssertionError("should 
>>>>>>>>>>>>>>>>>>>>>> not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>              }
>>>>>>>>>>>>>>>>>>>>>>          };
>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> For testing purposes, a possible workaround could 
>>>>>>>>>>>>>>>>>>>>>> be to have a cache that maps the callback address 
>>>>>>>>>>>>>>>>>>>>>> to a method handle that has the address bound to 
>>>>>>>>>>>>>>>>>>>>>> the first parameter. Assuming readdir always gets 
>>>>>>>>>>>>>>>>>>>>>> the same filler callback address, the same 
>>>>>>>>>>>>>>>>>>>>>> MethodHandle will be reused and eventually 
>>>>>>>>>>>>>>>>>>>>>> customized which means the callback address will 
>>>>>>>>>>>>>>>>>>>>>> become constant, and the downcall should then be 
>>>>>>>>>>>>>>>>>>>>>> intrinsified.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine to 
>>>>>>>>>>>>>>>>>>>>>> test this, but if you want to try it out, the 
>>>>>>>>>>>>>>>>>>>>>> patch should be this:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> diff --git 
>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>>>>>>>> --- 
>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +++ 
>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>>>>>>>   package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>   import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>>>>>>>   import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>>>>>>>   public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface 
>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>           return 
>>>>>>>>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class, 
>>>>>>>>>>>>>>>>>>>>>> fi, constants$0.fuse_fill_dir_t$FUNC, 
>>>>>>>>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", 
>>>>>>>>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>>       static fuse_fill_dir_t 
>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>> -        return 
>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) 
>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>> -            try {
>>>>>>>>>>>>>>>>>>>>>> -                return 
>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>> -            } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>> -                throw new AssertionError("should 
>>>>>>>>>>>>>>>>>>>>>> not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>>>>>>>>> -        };
>>>>>>>>>>>>>>>>>>>>>> +        class CacheHolder {
>>>>>>>>>>>>>>>>>>>>>> +            static final Map<MemoryAddress, 
>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>>>>>>> +        return 
>>>>>>>>>>>>>>>>>>>>>> CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>>>>>>>>>>>>>>>>> +            final MethodHandle target = 
>>>>>>>>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 
>>>>>>>>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>>>>>>>>> +            return 
>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) 
>>>>>>>>>>>>>>>>>>>>>> -> {
>>>>>>>>>>>>>>>>>>>>>> +                try {
>>>>>>>>>>>>>>>>>>>>>> +                    return 
>>>>>>>>>>>>>>>>>>>>>> (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>> +                } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>> +                    throw new 
>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>> +                }
>>>>>>>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>>>>>>>> +        });
>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too 
>>>>>>>>>>>>>>>>>>>>>> much by line wrapping)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark 
>>>>>>>>>>>>>>>>>>>>>>> test, that includes several down- and upcalls. 
>>>>>>>>>>>>>>>>>>>>>>> First, let me explain, what I'm testing here:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, 
>>>>>>>>>>>>>>>>>>>>>>> mostly for experimental purposes right now, and 
>>>>>>>>>>>>>>>>>>>>>>> I'm trying to beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> While there are some other interesting metrics, 
>>>>>>>>>>>>>>>>>>>>>>> such as read/write performance (both 
>>>>>>>>>>>>>>>>>>>>>>> sequentially and random access), I focused on 
>>>>>>>>>>>>>>>>>>>>>>> directory listings for now. Directory listings 
>>>>>>>>>>>>>>>>>>>>>>> are the most complex operation in regards to the 
>>>>>>>>>>>>>>>>>>>>>>> number of down- and upcalls:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback 
>>>>>>>>>>>>>>>>>>>>>>> function
>>>>>>>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in 
>>>>>>>>>>>>>>>>>>>>>>> the directory
>>>>>>>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no longer 
>>>>>>>>>>>>>>>>>>>>>>> required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces 
>>>>>>>>>>>>>>>>>>>>>>> additional noise (such as readxattr and trying 
>>>>>>>>>>>>>>>>>>>>>>> to access files that I didn't report in readdir))
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this: 
>>>>>>>>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();` 
>>>>>>>>>>>>>>>>>>>>>>> with the volume reporting eight files [2]. When 
>>>>>>>>>>>>>>>>>>>>>>> mounting with debug logs enabled, I can see that 
>>>>>>>>>>>>>>>>>>>>>>> the exact same operations in the same order are 
>>>>>>>>>>>>>>>>>>>>>>> invoked on both fuse-jnr and fuse-panama. One 
>>>>>>>>>>>>>>>>>>>>>>> single dir listing results in 2 readdir upcalls, 
>>>>>>>>>>>>>>>>>>>>>>> 10 callback downcalls, 16 getattr upcalls. There 
>>>>>>>>>>>>>>>>>>>>>>> are also 8 getxattr calls and 16 lookup calls, 
>>>>>>>>>>>>>>>>>>>>>>> however they don't reach Java, as the FUSE 
>>>>>>>>>>>>>>>>>>>>>>> kernel knows they are not implemented.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>> Benchmark                        Mode  Cnt 
>>>>>>>>>>>>>>>>>>>>>>>    Score Error  Units
>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5 
>>>>>>>>>>>>>>>>>>>>>>>   66,569 ± 3,128  us/op
>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5 
>>>>>>>>>>>>>>>>>>>>>>>  189,340 ± 4,275  us/op
>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit 
>>>>>>>>>>>>>>>>>>>>>>> 42e03fd7c6a built with: `configure 
>>>>>>>>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ 
>>>>>>>>>>>>>>>>>>>>>>> --with-native-debug-symbols=none 
>>>>>>>>>>>>>>>>>>>>>>> --with-debug-level=release 
>>>>>>>>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm 
>>>>>>>>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from. 
>>>>>>>>>>>>>>>>>>>>>>> Maybe creating a newConfinedScope() during each 
>>>>>>>>>>>>>>>>>>>>>>> upcall [3] is "too much"? Maybe JNR is just 
>>>>>>>>>>>>>>>>>>>>>>> negligently skipping some memory boundary checks 
>>>>>>>>>>>>>>>>>>>>>>> to be faster. The results are not terrible, but 
>>>>>>>>>>>>>>>>>>>>>>> I'd hoped for something better.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>>>>>>>>> >
>>>