Real-Life Benchmark for FUSE's readdir()

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Fri Jul 16 10:18:16 UTC 2021


Here are some observations, based on more testing I did overnight:

* the benchmark is relatively insensitive to intrinsifications being 
applied or not - lot of time is spent on the kernel anyway
* Panama has huge GC activity - lot of garbage is generated, compared to 
JNR - a lot of int[] arrays, it seems
* Tracking where the allocation is coming from is proving problematic, 
as execution originates in aynchronous threads, so most profilers have 
issues with that
* Removing support for the OPEN_DIR operation brings the allocation back 
in control, and benchmark number back to normal (at least here)

The last bullet doesn't make a lot of sense, but that's what I'm seeing 
consistently. Note that it's not the _implementation_ of OPEN_DIR that 
is creating the garbage, or doing something strange - simply 
_registering_ an empty callback for OPEN_DIR will cause the GC issues 
and the slowdown.

I'll look more into this.

Maurizio

On 15/07/2021 22:09, Maurizio Cimadamore wrote:
> Hah - hit "Send" too soon - of course the numbers below are almost 
> useless because the file system is set up on a separate JVM which the 
> benchmark cannot see.
>
> I'll try to change the benchmark to run the file system in the same 
> process.
>
> Maurizio
>
> On 15/07/2021 22:05, Maurizio Cimadamore wrote:
>> I managed to do an initial benchmark pass with JMH. The numbers don't 
>> look great:
>>
>> ```
>> Benchmark                        Mode  Cnt   Score    Error Units
>> BenchmarkTest.testListDirJnr     avgt    5   9.858 ±  0.702 us/op
>> BenchmarkTest.testListDirPanama  avgt    5  38.008 ± 17.573 us/op
>> ```
>>
>> On my machine the Panama fuse seems 4x slower than the JNR one 
>> (assuming I got the implementation correctly, that is :-)).
>>
>> A quick look at GC, reveals that Panama allocates 4x _less_ memory 
>> than JNR:
>>
>> ```
>> Benchmark Mode  Cnt    Score      Error   Units
>> BenchmarkTest.testListDirJnr:·gc.alloc.rate avgt    5   33.255 ±    
>> 1.584  MB/sec
>> BenchmarkTest.testListDirJnr:·gc.alloc.rate.norm avgt    5 368.033 
>> ±    0.047    B/op
>> BenchmarkTest.testListDirJnr:·gc.count avgt    5 6.000             
>> counts
>> BenchmarkTest.testListDirJnr:·gc.time avgt    5 9.000                 ms
>> BenchmarkTest.testListDirPanama:·gc.alloc.rate avgt    5 8.709 ±    
>> 3.887  MB/sec
>> BenchmarkTest.testListDirPanama:·gc.alloc.rate.norm avgt    5 368.046 
>> ±    0.236    B/op
>> BenchmarkTest.testListDirPanama:·gc.count avgt    5 2.000             
>> counts
>> BenchmarkTest.testListDirPanama:·gc.time avgt    5 
>> 3.000                 ms
>> ```
>>
>> And, looking with perfasm, the distribution of the various methods 
>> look similar, and actually not a lot of time is spent in Java at all:
>>
>> JNR:
>>
>> ```
>> ...[Hottest Methods (after 
>> inlining)]..............................................................
>>  90.86%              kernel  [unknown]
>>   2.13%         c2, level 4  java.nio.file.Files::list, version 913
>>   1.44%         c2, level 4 
>> de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirJnr_jmhTest::testListDirJnr_avgt_jmhStub, 
>> version 934
>>   0.81%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0, 
>> version 857
>>   0.70%        libc-2.31.so  __close_nocancel
>>   0.48%        libc-2.31.so  malloc
>>   0.39%        libc-2.31.so  _int_free
>>   0.32%        libc-2.31.so  _int_malloc
>>   0.29%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0, 
>> version 879
>>   0.22%        libc-2.31.so  __close
>>   0.21%        libc-2.31.so  __GI___libc_open
>>   0.19%    Unknown, level 0 
>> sun.nio.fs.UnixNativeDispatcher::fdopendir, version 859
>>   0.16%        libc-2.31.so  __libc_enable_asynccancel
>>   0.16%        libc-2.31.so  __GI___dup
>>   0.15%        libc-2.31.so  __fxstat64
>>   0.14%    Unknown, level 0 
>> sun.nio.fs.UnixNativeDispatcher::closedir, version 878
>>   0.14%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
>>   0.13%        libc-2.31.so  __libc_disable_asynccancel
>>   0.13%           libnio.so 
>> Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
>>   0.13%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
>>   0.84%  <...other 69 warm methods...>
>> ```
>>
>> Panama:
>>
>> ```
>> ....[Hottest Methods (after 
>> inlining)]..............................................................
>>  89.07%              kernel  [unknown]
>>   2.35%         c2, level 4  java.nio.file.Files::list, version 890
>>   1.45%         c2, level 4 
>> de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirPanama_jmhTest::testListDirPanama_avgt_jmhStub, 
>> version 917
>>   1.21%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0, 
>> version 849
>>   0.62%        libc-2.31.so  malloc
>>   0.60%        libc-2.31.so  __close_nocancel
>>   0.47%        libc-2.31.so  _int_free
>>   0.46%        libc-2.31.so  _int_malloc
>>   0.41%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0, 
>> version 869
>>   0.32%        libc-2.31.so  __close
>>   0.32%        libc-2.31.so  __GI___libc_open
>>   0.25%    Unknown, level 0 
>> sun.nio.fs.UnixNativeDispatcher::fdopendir, version 851
>>   0.23%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
>>   0.20%    Unknown, level 0 
>> sun.nio.fs.UnixNativeDispatcher::closedir, version 868
>>   0.17%        libc-2.31.so  __GI___dup
>>   0.15%        libc-2.31.so  __alloc_dir
>>   0.15%           libnio.so 
>> Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
>>   0.15%        libc-2.31.so  __libc_disable_asynccancel
>>   0.13%        libc-2.31.so  __libc_enable_asynccancel
>>   0.12%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
>>   1.17%  <...other 81 warm methods...>
>> ```
>>
>> So, no smoking gun so far. I'll keep looking.
>>
>> Maurizio
>>
>> On 15/07/2021 19:14, Maurizio Cimadamore wrote:
>>> Ok, I got it working - what fixed it for me was that the offsets in 
>>> the "filler" calls has to be set to zero, which, looking around, 
>>> seems like a default value which leaves the kernel to take care of it.
>>>
>>> ```
>>> $ ls Volumes/foo
>>>
>>> total 0
>>> -r--r--r-- 1 root root  0 Jan  1  1970 aaa
>>> -r--r--r-- 1 root root  0 Jan  1  1970 bbb
>>> -r--r--r-- 1 root root  0 Jan  1  1970 ccc
>>> -r--r--r-- 1 root root  0 Jan  1  1970 ddd
>>> -r--r--r-- 1 root root 13 Jan  1  1970 hello.txt
>>> -r--r--r-- 1 root root  0 Jan  1  1970 xxx
>>> -r--r--r-- 1 root root  0 Jan  1  1970 yyy
>>> -r--r--r-- 1 root root  0 Jan  1  1970 zzz
>>> ```
>>>
>>> Something is probably not right (see "total 0" on top), but perhaps 
>>> should be good enough for benchmark. Applying same fix on the JNR 
>>> also fixes that (and results in same output). Hopefully I'm good to 
>>> go now :-)
>>>
>>> Maurizio
>>>
>>> On 15/07/2021 18:15, Maurizio Cimadamore wrote:
>>>> I've re-extracted on Linux, and it seems more lively now. I had to 
>>>> fix a couple of type mismatches (e.g. long vs. int and int vs. 
>>>> short) in places, and also some of the fields in the "stat" 
>>>> structure are different, so the code won't compile as is.
>>>>
>>>> After fixing these minor issue, I see a lot more output printed 
>>>> when I mount, and when I do an ls, I see the following lines reported:
>>>>
>>>> [Thread-110] DEBUG 
>>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - getattr() /
>>>> [Thread-111] DEBUG 
>>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - opendir() /
>>>> [Thread-112] DEBUG 
>>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - readdir() /
>>>> [Thread-113] DEBUG 
>>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - releasedir() /
>>>>
>>>> But that's pretty much it, and in the terminal I still get the 
>>>> input/output error. I can even debug, but I can't see much of 
>>>> what's going wrong (and I'm not familiar with this API) - the Java 
>>>> code executes fine, for what it's worth.
>>>>
>>>> Maurizio
>>>>
>>>> On 15/07/2021 17:51, Sebastian Stenzel wrote:
>>>>> I'll fix it for Linux and let you know!
>>>>>
>>>>>> On 15. Jul 2021, at 18:04, Maurizio Cimadamore 
>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>
>>>>>>
>>>>>> On 15/07/2021 17:01, Sebastian Stenzel wrote:
>>>>>>> Yeah well they're pretty much mac-specific. On macOS, FUSE has 
>>>>>>> this magic behaviour where you can tell it to mount to a 
>>>>>>> non-existing mountpoint inside of `/Volumes/...` and it'll just 
>>>>>>> create these (and destroy them on unmount). I believe on Linux 
>>>>>>> you need to define an _existing_ mount point. But it is surely 
>>>>>>> possible that the volume isn't working yet on Linux. I'll give 
>>>>>>> it a try myself and
>>>>>> I created a folder Volumes under my home folder and trying to use 
>>>>>> that as a mount point, which seems to work ok with JNR.
>>>>>>> fix it, if required.
>>>>>>>
>>>>>>> The benchmark then needs to be adjusted for the two mountpoints 
>>>>>>> respectively. But before the benchmark can actually do anything, 
>>>>>>> a plain `ls` on the terminal needs to work.
>>>>>>
>>>>>> Ok, it seems even JNR fails:
>>>>>>
>>>>>> ```
>>>>>> $ ls Volumes/bar/
>>>>>> ls: reading directory 'Volumes/bar/': Input/output error
>>>>>> ```
>>>>>>
>>>>>>>
>>>>>>>> On 15. Jul 2021, at 17:55, Maurizio Cimadamore 
>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>
>>>>>>>> I tried to reproduce here on Linux, but with no luck - in the 
>>>>>>>> sense that I'm not super sure on how to run the benchmark.
>>>>>>>>
>>>>>>>> I'm able somehow to run the two examples - and I noted that 
>>>>>>>> here JNR works ok, while the Panama one doesn't seem to mount 
>>>>>>>> things correctly - a new mount appears on my file explorer, but 
>>>>>>>> I'm unable to do anything with it (even unmount - which can 
>>>>>>>> only be done at sudo level).
>>>>>>>>
>>>>>>>> When working with the JNR support, the mount works fine, it 
>>>>>>>> shows in the file explorer, I can click on that location and 
>>>>>>>> browse, and then unmount from there. Everything works.
>>>>>>>>
>>>>>>>> That said, the benchmarks require the mount points to be up and 
>>>>>>>> running - so I've tried first to execute the example (e.g. JNR) 
>>>>>>>> and then run the benchmark in two separate terminal windows, 
>>>>>>>> all via Maven - but the benchmark doesn't seem to do anything 
>>>>>>>> (I've uncommented the benchmarks of course).
>>>>>>>>
>>>>>>>> How do you run them?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>> On 15/07/2021 13:40, Maurizio Cimadamore wrote:
>>>>>>>>> I believe that it would be more useful to try to run the 
>>>>>>>>> perfasm profiler with JMH.
>>>>>>>>>
>>>>>>>>> This can be done relatively easily, at least on linux, if you 
>>>>>>>>> pass the argument `-prof perfasm` to JMH. (this would need 
>>>>>>>>> hsdid-amd64.so on Linux to print readable assembly).
>>>>>>>>>
>>>>>>>>> Another thing worth checking is allocation rate: `-prof gc`.
>>>>>>>>>
>>>>>>>>> Maurizio
>>>>>>>>>
>>>>>>>>> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>>>>>>>>>> Ok it really seems like VisualVM can't deal with these kinds 
>>>>>>>>>> of tasks yet. Now it reports the String constructor being the 
>>>>>>>>>> culprit [1], however I strongly doubt that, since this is 
>>>>>>>>>> probably one of the most heavily optimized parts of the JDK.
>>>>>>>>>>
>>>>>>>>>> [1]: Screenshot 
>>>>>>>>>> onhttps://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$ 
>>>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$><https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$ 
>>>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>> 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel 
>>>>>>>>>>> <sebastian.stenzel at gmail.com 
>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com 
>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Yes, must be a sampling error. Do you know of a (publicly 
>>>>>>>>>>> available) _profiler_ that is compatible with JDK 17 / 18 
>>>>>>>>>>> already?
>>>>>>>>>>>
>>>>>>>>>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore 
>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Aha - it seems like you are seeing what I was seeing: 
>>>>>>>>>>>> unrolling now seems to happen more reliably, which 
>>>>>>>>>>>> positively affect code like strlen.
>>>>>>>>>>>>
>>>>>>>>>>>> As for FUSE, I think the reason for the difference has 
>>>>>>>>>>>> probably nothing to do with string conversion - the sampler 
>>>>>>>>>>>> profiler just happens to hit that code a lot. I checked JNR 
>>>>>>>>>>>> code for string conversion and I couldn't really find 
>>>>>>>>>>>> anything uber optimized in that regard that could explain 
>>>>>>>>>>>> the gap.
>>>>>>>>>>>>
>>>>>>>>>>>> Probably something is not getting optimized as it should - 
>>>>>>>>>>>> likely a downcall/upcall intrinsification is failing - 
>>>>>>>>>>>> maybe due to a subtle issue with your code, or, possibly 
>>>>>>>>>>>> because you are hitting a non-implemented case (e.g. we do 
>>>>>>>>>>>> not intrinsify calls which pass arguments on the stack, 
>>>>>>>>>>>> yet), or because of some other bug.
>>>>>>>>>>>>
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>> Wow, I stand corrected. I just re-ran the benchmark and 
>>>>>>>>>>>>> `benchmarkStrlenBase` just got a lot faster!! Your change 
>>>>>>>>>>>>> inhttps://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$ 
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$><https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$ 
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$>> 
>>>>>>>>>>>>> DID have an affect after all.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just doesn't impress FUSE very much...
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel 
>>>>>>>>>>>>>> <sebastian.stenzel at gmail.com 
>>>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com 
>>>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yup, I tried the int-approach as well, but with worse 
>>>>>>>>>>>>>> results... Here is the full 
>>>>>>>>>>>>>> test:https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$ 
>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$><https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$ 
>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>> 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore 
>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok. Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried similar experiments where instead of reading 4 
>>>>>>>>>>>>>>> bytes separately I'd read a single int value, and then 
>>>>>>>>>>>>>>> use shifts and bitmasking to check for terminators. On 
>>>>>>>>>>>>>>> paper good, but benchmark results were always worse than 
>>>>>>>>>>>>>>> the version we have now (at least on Linux).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That said, if you could please share the full string 
>>>>>>>>>>>>>>> benchmark you have, that'd be helpful, so we can take a 
>>>>>>>>>>>>>>> look at that, and see what's going wrong (ideally, C2 
>>>>>>>>>>>>>>> should be the one doing unrolling).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>> I just did a quick snythetic test on a "manually 
>>>>>>>>>>>>>>>> unrolled" strlen() without any FUSE context.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I experimented with an implementation that looked like 
>>>>>>>>>>>>>>>> the following and benchmarked it using a 259 byte 
>>>>>>>>>>>>>>>> memory segment containing a 239 byte string (null byte 
>>>>>>>>>>>>>>>> at index 240):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>> private static int strlenUnroll4(MemorySegment segment, 
>>>>>>>>>>>>>>>> long start) {
>>>>>>>>>>>>>>>> int offset;
>>>>>>>>>>>>>>>> for (offset = 0; offset < segment.byteSize()-3; 
>>>>>>>>>>>>>>>> offset+=4) {
>>>>>>>>>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>>> offset + 0);
>>>>>>>>>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>>> offset + 1);
>>>>>>>>>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>>> offset + 2);
>>>>>>>>>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>>> offset + 3);
>>>>>>>>>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is 
>>>>>>>>>>>>>>>> this even faster than directly having 4 different 
>>>>>>>>>>>>>>>> branches?
>>>>>>>>>>>>>>>> if (b0 == 0) {
>>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>>> } else if (b1 == 0) {
>>>>>>>>>>>>>>>> return offset + 1;
>>>>>>>>>>>>>>>> } else if (b2 == 0) {
>>>>>>>>>>>>>>>> return offset + 2;
>>>>>>>>>>>>>>>> } else if (b3 == 0) {
>>>>>>>>>>>>>>>> return offset + 3;
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no 
>>>>>>>>>>>>>>>> loop required for the remaining <4 bytes?
>>>>>>>>>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + 
>>>>>>>>>>>>>>>> offset);
>>>>>>>>>>>>>>>> if (b == 0) {
>>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not even sure how reliable my results are, since I 
>>>>>>>>>>>>>>>> have no clue about how branch prediction works here... 
>>>>>>>>>>>>>>>> Neither have I tested the correctness of this 
>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore 
>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com> 
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for reporting back.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We probably need to investigate this a bit more deeply 
>>>>>>>>>>>>>>>>> and try and reproduce on our side.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One last question: you said that with manual unrolling 
>>>>>>>>>>>>>>>>> you managed to get 2x faster: did you mean that string 
>>>>>>>>>>>>>>>>> conversion got 2x faster or that you actually saw your 
>>>>>>>>>>>>>>>>> FUSE benchmark going 2x faster because of the manual 
>>>>>>>>>>>>>>>>> unrolling with strings?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>> That, surprisingly, didn't change anything either. 
>>>>>>>>>>>>>>>>>> But don't worry too much, the performance isn't bad 
>>>>>>>>>>>>>>>>>> (in absolute figures) and it is by far not the only 
>>>>>>>>>>>>>>>>>> reason why I consider panama the best solution to 
>>>>>>>>>>>>>>>>>> create java bindings for c libs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore 
>>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com> 
>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Actually, after some bisecting, I found out that the 
>>>>>>>>>>>>>>>>>>> performance of converting a memory segment into a 
>>>>>>>>>>>>>>>>>>> string jumped 2x faster with this fix:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$> 
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>> 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Which was integrated after the one I originally 
>>>>>>>>>>>>>>>>>>> pointed at. They both seem to touch loop 
>>>>>>>>>>>>>>>>>>> optimization in case of overflows, which the strlen 
>>>>>>>>>>>>>>>>>>> code is triggering (since the loop limit checks for 
>>>>>>>>>>>>>>>>>>> loop variable being positive).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is a simple patch which adds a string 
>>>>>>>>>>>>>>>>>>> conversion test:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> diff --git 
>>>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>>>>>>>>>> --- 
>>>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +++ 
>>>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, 
>>>>>>>>>>>>>>>>>>> true));
>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +    MemorySegment segment;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>      @Setup
>>>>>>>>>>>>>>>>>>>      public void setup() {
>>>>>>>>>>>>>>>>>>>          str = makeString(size);
>>>>>>>>>>>>>>>>>>>          segmentAllocator = 
>>>>>>>>>>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size 
>>>>>>>>>>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>>>>>>>>>> +        segment = toCString(str, segmentAllocator);
>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>      @TearDown
>>>>>>>>>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>>>          scope.close();
>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +    @Benchmark
>>>>>>>>>>>>>>>>>>> +    public String panama_str_conv() throws Throwable {
>>>>>>>>>>>>>>>>>>> +        return CLinker.toJavaString(segment);
>>>>>>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>>>>>>>>>      public int jni_strlen() throws Throwable {
>>>>>>>>>>>>>>>>>>>          return strlen(str);
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt 
>>>>>>>>>>>>>>>>>>>    Score   Error Units
>>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30 
>>>>>>>>>>>>>>>>>>>  106.613 ? 7.060 ns/op
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt 
>>>>>>>>>>>>>>>>>>>   Score   Error Units
>>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30 
>>>>>>>>>>>>>>>>>>>  48.120 ? 0.557 ns/op
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So, as you can see, a pretty sizeable jump. 
>>>>>>>>>>>>>>>>>>> Eyeballing, the shape of generated code doesn't look 
>>>>>>>>>>>>>>>>>>> too different, which makes me think of another case 
>>>>>>>>>>>>>>>>>>> where loop is unrolled, but main loop never executed 
>>>>>>>>>>>>>>>>>>> (similar to JDK-8269230), but we'll need to look 
>>>>>>>>>>>>>>>>>>> deeper.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a 
>>>>>>>>>>>>>>>>>>>>> (for details how I built the JDK, see my initial 
>>>>>>>>>>>>>>>>>>>>> email). Maybe I'm missing some compiler flags to 
>>>>>>>>>>>>>>>>>>>>> enable all optimizations?
>>>>>>>>>>>>>>>>>>>> I see - you do have the latest panama changes, but 
>>>>>>>>>>>>>>>>>>>> there has been a sync with upstream after that 
>>>>>>>>>>>>>>>>>>>> changeset, I believe - can you please try to resync 
>>>>>>>>>>>>>>>>>>>> with the latest foreign-jextract commit - which 
>>>>>>>>>>>>>>>>>>>> should be:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$> 
>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>> 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Are you sure about loop vectorization being 
>>>>>>>>>>>>>>>>>>>>> applied to strlen? I'm not an expert on this 
>>>>>>>>>>>>>>>>>>>>> field, but I had the impression this wasn't 
>>>>>>>>>>>>>>>>>>>>> possible when the loop terminates "from within".
>>>>>>>>>>>>>>>>>>>> Vlad is the expert here - when chatting offline he 
>>>>>>>>>>>>>>>>>>>> did mention that loop should have single exit - 
>>>>>>>>>>>>>>>>>>>> which I guess also takes into account the "normal" 
>>>>>>>>>>>>>>>>>>>> exit - so the strlen routine would seem to have two 
>>>>>>>>>>>>>>>>>>>> exits...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore 
>>>>>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>> thanks for sharing your findings - I've done some 
>>>>>>>>>>>>>>>>>>>>>> attempts here with a targeted microbenchmark 
>>>>>>>>>>>>>>>>>>>>>> which measures the performance of string 
>>>>>>>>>>>>>>>>>>>>>> conversion and I'm seeing unrolling and 
>>>>>>>>>>>>>>>>>>>>>> vectorization being applied on the strlen 
>>>>>>>>>>>>>>>>>>>>>> computation.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not 
>>>>>>>>>>>>>>>>>>>>>> been updated in the last few weeks? There has 
>>>>>>>>>>>>>>>>>>>>>> been a C2 optimization fix which has been added 
>>>>>>>>>>>>>>>>>>>>>> recently, which I think might be related to this:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230 
>>>>>>>>>>>>>>>>>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8269230>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> good idea, but it makes no difference beyond 
>>>>>>>>>>>>>>>>>>>>>>> statistical error.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I started sampling the application with VisualVM 
>>>>>>>>>>>>>>>>>>>>>>> (which is quite hard, since native threads are 
>>>>>>>>>>>>>>>>>>>>>>> extremely short-lived. What I noticed is, that 
>>>>>>>>>>>>>>>>>>>>>>> regardless of where the sampler interrupts a 
>>>>>>>>>>>>>>>>>>>>>>> thread, in nearly all cases 100% of CPU time are 
>>>>>>>>>>>>>>>>>>>>>>> caused by 
>>>>>>>>>>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() 
>>>>>>>>>>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due 
>>>>>>>>>>>>>>>>>>>>>>> to the nature of null termination, but maybe we 
>>>>>>>>>>>>>>>>>>>>>>> can make use of the fact that we're dealing with 
>>>>>>>>>>>>>>>>>>>>>>> MemorySegments here: Since they protect us from 
>>>>>>>>>>>>>>>>>>>>>>> overflows, maybe there is no need to look at 
>>>>>>>>>>>>>>>>>>>>>>> only a single byte at a time. Maybe the 
>>>>>>>>>>>>>>>>>>>>>>> strlen()-loop can be unrolled or even be 
>>>>>>>>>>>>>>>>>>>>>>> vectorized.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I just did a quick test and observed a x2 
>>>>>>>>>>>>>>>>>>>>>>> speedup when doing a x4 loop unroll.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee 
>>>>>>>>>>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code, 
>>>>>>>>>>>>>>>>>>>>>>>> one possible explanation for the discrepancy I 
>>>>>>>>>>>>>>>>>>>>>>>> can think of is that the DirFiller ends up 
>>>>>>>>>>>>>>>>>>>>>>>> using virtual downcalls to do it's work, which 
>>>>>>>>>>>>>>>>>>>>>>>> are currently not intrinsified. Being mostly a 
>>>>>>>>>>>>>>>>>>>>>>>> case of 'not implemented yet', i.e. it is a 
>>>>>>>>>>>>>>>>>>>>>>>> known issue.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>      static fuse_fill_dir_t 
>>>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>>>          return 
>>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long 
>>>>>>>>>>>>>>>>>>>>>>>> x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>>              try {
>>>>>>>>>>>>>>>>>>>>>>>>                  return 
>>>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is 
>>>>>>>>>>>>>>>>>>>>>>>> not a constant, so the call is virtual
>>>>>>>>>>>>>>>>>>>>>>>>              } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>>                  throw new 
>>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>>              }
>>>>>>>>>>>>>>>>>>>>>>>>          };
>>>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> For testing purposes, a possible workaround 
>>>>>>>>>>>>>>>>>>>>>>>> could be to have a cache that maps the callback 
>>>>>>>>>>>>>>>>>>>>>>>> address to a method handle that has the address 
>>>>>>>>>>>>>>>>>>>>>>>> bound to the first parameter. Assuming readdir 
>>>>>>>>>>>>>>>>>>>>>>>> always gets the same filler callback address, 
>>>>>>>>>>>>>>>>>>>>>>>> the same MethodHandle will be reused and 
>>>>>>>>>>>>>>>>>>>>>>>> eventually customized which means the callback 
>>>>>>>>>>>>>>>>>>>>>>>> address will become constant, and the downcall 
>>>>>>>>>>>>>>>>>>>>>>>> should then be intrinsified.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine 
>>>>>>>>>>>>>>>>>>>>>>>> to test this, but if you want to try it out, 
>>>>>>>>>>>>>>>>>>>>>>>> the patch should be this:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>> diff --git 
>>>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>>>>>>>>>> --- 
>>>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> +++ 
>>>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>>>>>>>>>   package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>>>   import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>>>>>>>>>   import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>>>>>>>>>   public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface 
>>>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>>>           return 
>>>>>>>>>>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class, 
>>>>>>>>>>>>>>>>>>>>>>>> fi, constants$0.fuse_fill_dir_t$FUNC, 
>>>>>>>>>>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", 
>>>>>>>>>>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>>>>       static fuse_fill_dir_t 
>>>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>>> -        return 
>>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long 
>>>>>>>>>>>>>>>>>>>>>>>> x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>> -            try {
>>>>>>>>>>>>>>>>>>>>>>>> -                return 
>>>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>>> -            } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>> -                throw new 
>>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>>>>>>>>>>> -        };
>>>>>>>>>>>>>>>>>>>>>>>> +        class CacheHolder {
>>>>>>>>>>>>>>>>>>>>>>>> +            static final Map<MemoryAddress, 
>>>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new 
>>>>>>>>>>>>>>>>>>>>>>>> ConcurrentHashMap<>();
>>>>>>>>>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>>>>>>>>> +        return 
>>>>>>>>>>>>>>>>>>>>>>>> CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>>>>>>>>>>>>>>>>>>> +            final MethodHandle target = 
>>>>>>>>>>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 
>>>>>>>>>>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>>>>>>>>>>> +            return 
>>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long 
>>>>>>>>>>>>>>>>>>>>>>>> x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>> +                try {
>>>>>>>>>>>>>>>>>>>>>>>> +                    return 
>>>>>>>>>>>>>>>>>>>>>>>> (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>>> +                } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>> +                    throw new 
>>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>> +                }
>>>>>>>>>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>>>>>>>>>> +        });
>>>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too 
>>>>>>>>>>>>>>>>>>>>>>>> much by line wrapping)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark 
>>>>>>>>>>>>>>>>>>>>>>>>> test, that includes several down- and upcalls. 
>>>>>>>>>>>>>>>>>>>>>>>>> First, let me explain, what I'm testing here:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, 
>>>>>>>>>>>>>>>>>>>>>>>>> mostly for experimental purposes right now, 
>>>>>>>>>>>>>>>>>>>>>>>>> and I'm trying to beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> While there are some other interesting 
>>>>>>>>>>>>>>>>>>>>>>>>> metrics, such as read/write performance (both 
>>>>>>>>>>>>>>>>>>>>>>>>> sequentially and random access), I focused on 
>>>>>>>>>>>>>>>>>>>>>>>>> directory listings for now. Directory listings 
>>>>>>>>>>>>>>>>>>>>>>>>> are the most complex operation in regards to 
>>>>>>>>>>>>>>>>>>>>>>>>> the number of down- and upcalls:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a 
>>>>>>>>>>>>>>>>>>>>>>>>> callback function
>>>>>>>>>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item 
>>>>>>>>>>>>>>>>>>>>>>>>> in the directory
>>>>>>>>>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no 
>>>>>>>>>>>>>>>>>>>>>>>>> longer required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces 
>>>>>>>>>>>>>>>>>>>>>>>>> additional noise (such as readxattr and trying 
>>>>>>>>>>>>>>>>>>>>>>>>> to access files that I didn't report in readdir))
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this: 
>>>>>>>>>>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();` 
>>>>>>>>>>>>>>>>>>>>>>>>> with the volume reporting eight files [2]. 
>>>>>>>>>>>>>>>>>>>>>>>>> When mounting with debug logs enabled, I can 
>>>>>>>>>>>>>>>>>>>>>>>>> see that the exact same operations in the same 
>>>>>>>>>>>>>>>>>>>>>>>>> order are invoked on both fuse-jnr and 
>>>>>>>>>>>>>>>>>>>>>>>>> fuse-panama. One single dir listing results in 
>>>>>>>>>>>>>>>>>>>>>>>>> 2 readdir upcalls, 10 callback downcalls, 16 
>>>>>>>>>>>>>>>>>>>>>>>>> getattr upcalls. There are also 8 getxattr 
>>>>>>>>>>>>>>>>>>>>>>>>> calls and 16 lookup calls, however they don't 
>>>>>>>>>>>>>>>>>>>>>>>>> reach Java, as the FUSE kernel knows they are 
>>>>>>>>>>>>>>>>>>>>>>>>> not implemented.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>> Benchmark                        Mode  Cnt 
>>>>>>>>>>>>>>>>>>>>>>>>>    Score Error  Units
>>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5 
>>>>>>>>>>>>>>>>>>>>>>>>>   66,569 ± 3,128  us/op
>>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5 
>>>>>>>>>>>>>>>>>>>>>>>>>  189,340 ± 4,275  us/op
>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit 
>>>>>>>>>>>>>>>>>>>>>>>>> 42e03fd7c6a built with: `configure 
>>>>>>>>>>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ 
>>>>>>>>>>>>>>>>>>>>>>>>> --with-native-debug-symbols=none 
>>>>>>>>>>>>>>>>>>>>>>>>> --with-debug-level=release 
>>>>>>>>>>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm 
>>>>>>>>>>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from. 
>>>>>>>>>>>>>>>>>>>>>>>>> Maybe creating a newConfinedScope() during 
>>>>>>>>>>>>>>>>>>>>>>>>> each upcall [3] is "too much"? Maybe JNR is 
>>>>>>>>>>>>>>>>>>>>>>>>> just negligently skipping some memory boundary 
>>>>>>>>>>>>>>>>>>>>>>>>> checks to be faster. The results are not 
>>>>>>>>>>>>>>>>>>>>>>>>> terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>


More information about the panama-dev mailing list