Real-Life Benchmark for FUSE's readdir()

Sebastian Stenzel sebastian.stenzel at gmail.com
Fri Jul 16 12:17:29 UTC 2021


Yes! Confirmed!

> On 16. Jul 2021, at 13:51, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> 
> Ha,
> you beat me to it.
> 
> After running a lot of different tools, we realized that the issue was that the benchmark was causing some 400G of Threads to be allocated (!!).
> 
> Jorn was quick in identifying the cause: when a thread in native code wants to play with the JVM, it has to do an attach operation, to associated that unknown native thread with a Java thread. The dual operation, detach, is used to disassociate a native thread from the JVM.
> 
> In JNI the native code has control as to when to attach and detach native threads to the VM. So you could, for instance, attach a native thread once, and then use that thread to make 1000 Java upcalls, and then detach it.
> 
> In Panama, since we don't control _how_ the native code calls back into Java code, we have taken a conservative approach of attaching a thread when it starts executing an upcall, and we detach it as soon as the upcall is finished. So, 1000 upcalls will create 1000 Java threads (potentially, for the same underlying native thread). Which is where the memory trashing is coming from.
> 
> This simple patch:
> 
> diff --git a/src/hotspot/share/prims/universalUpcallHandler.cpp b/src/hotspot/share/prims/universalUpcallHandler.cpp
> index d63017b20de..590f610e8d0 100644
> --- a/src/hotspot/share/prims/universalUpcallHandler.cpp
> +++ b/src/hotspot/share/prims/universalUpcallHandler.cpp
> @@ -68,8 +68,8 @@ Thread* ProgrammableUpcallHandler::maybe_attach_and_get_thread(bool* should_deta
>  }
> 
>  void ProgrammableUpcallHandler::detach_thread(Thread* thread) {
> -  JavaVM_ *vm = (JavaVM *)(&main_vm);
> -  vm->functions->DetachCurrentThread(vm);
> +  //JavaVM_ *vm = (JavaVM *)(&main_vm);
> +  //vm->functions->DetachCurrentThread(vm);
>  }
> 
>  // modelled after JavaCallWrapper::JavaCallWrapper
> 
> 
> Basically sidesteps the issue by NOT detaching threads after an upcall is complete. In cases, like Fuse, where the same pool of thread keeps hitting Java code, this is beneficial, as we will "remember" native threads across upcalls. On my system, this brings performance back to normal.
> 
> Of course this is not a "safe" fix, in the sense that it has the potential of leaking memory (e.g. if the native code creates a new thread every time and then calls the VM with it, these threads will never be GC-able). But it would be very useful to know if this patch solves the issue for you as well - can you please give that a try?
> 
> Thanks!
> Maurizio
> 
> On 16/07/2021 12:33, Sebastian Stenzel wrote:
>> Hey,
>> 
>> thanks for all the testing so far. And for finding the bug with the wrong offset in the dirfiller. There are actually two variations how to use the filler, but explaining this leads too far.
>> 
>> In the meantime, I've added a Linux implementation as well.
>> 
>> And I made an interesting observation: When running FUSE in single-threaded mode (add "-s" flag to mount options), there is no performance difference any longer:
>> 
>> ```
>> Benchmark                        Mode  Cnt   Score   Error  Units
>> BenchmarkTest.testListDirJnr     avgt    5  67,229 ± 1,968  us/op
>> BenchmarkTest.testListDirPanama  avgt    5  66,955 ± 1,979  us/op
>> ```
>> 
>> So there seems to be some overhead attaching native threads to the JVM, that JNR avoids.
>> 
>> I still have to look into your results regarding allocations and the weird OPEN_DIR behaviour. I don't have any explanation for that just yet.
>> 
>> Thanks again!
>> 
>> Sebastian
>> 
>> 
>>> On 16. Jul 2021, at 12:18, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>> 
>>> Here are some observations, based on more testing I did overnight:
>>> 
>>> * the benchmark is relatively insensitive to intrinsifications being applied or not - lot of time is spent on the kernel anyway
>>> * Panama has huge GC activity - lot of garbage is generated, compared to JNR - a lot of int[] arrays, it seems
>>> * Tracking where the allocation is coming from is proving problematic, as execution originates in aynchronous threads, so most profilers have issues with that
>>> * Removing support for the OPEN_DIR operation brings the allocation back in control, and benchmark number back to normal (at least here)
>>> 
>>> The last bullet doesn't make a lot of sense, but that's what I'm seeing consistently. Note that it's not the _implementation_ of OPEN_DIR that is creating the garbage, or doing something strange - simply _registering_ an empty callback for OPEN_DIR will cause the GC issues and the slowdown.
>>> 
>>> I'll look more into this.
>>> 
>>> Maurizio
>>> 
>>> On 15/07/2021 22:09, Maurizio Cimadamore wrote:
>>>> Hah - hit "Send" too soon - of course the numbers below are almost useless because the file system is set up on a separate JVM which the benchmark cannot see.
>>>> 
>>>> I'll try to change the benchmark to run the file system in the same process.
>>>> 
>>>> Maurizio
>>>> 
>>>> On 15/07/2021 22:05, Maurizio Cimadamore wrote:
>>>>> I managed to do an initial benchmark pass with JMH. The numbers don't look great:
>>>>> 
>>>>> ```
>>>>> Benchmark                        Mode  Cnt   Score    Error Units
>>>>> BenchmarkTest.testListDirJnr     avgt    5   9.858 ±  0.702 us/op
>>>>> BenchmarkTest.testListDirPanama  avgt    5  38.008 ± 17.573 us/op
>>>>> ```
>>>>> 
>>>>> On my machine the Panama fuse seems 4x slower than the JNR one (assuming I got the implementation correctly, that is :-)).
>>>>> 
>>>>> A quick look at GC, reveals that Panama allocates 4x _less_ memory than JNR:
>>>>> 
>>>>> ```
>>>>> Benchmark Mode  Cnt    Score      Error   Units
>>>>> BenchmarkTest.testListDirJnr:·gc.alloc.rate avgt    5   33.255 ±    1.584  MB/sec
>>>>> BenchmarkTest.testListDirJnr:·gc.alloc.rate.norm avgt    5 368.033 ±    0.047    B/op
>>>>> BenchmarkTest.testListDirJnr:·gc.count avgt    5 6.000             counts
>>>>> BenchmarkTest.testListDirJnr:·gc.time avgt    5 9.000                 ms
>>>>> BenchmarkTest.testListDirPanama:·gc.alloc.rate avgt    5 8.709 ±    3.887  MB/sec
>>>>> BenchmarkTest.testListDirPanama:·gc.alloc.rate.norm avgt    5 368.046 ±    0.236    B/op
>>>>> BenchmarkTest.testListDirPanama:·gc.count avgt    5 2.000             counts
>>>>> BenchmarkTest.testListDirPanama:·gc.time avgt    5 3.000                 ms
>>>>> ```
>>>>> 
>>>>> And, looking with perfasm, the distribution of the various methods look similar, and actually not a lot of time is spent in Java at all:
>>>>> 
>>>>> JNR:
>>>>> 
>>>>> ```
>>>>> ...[Hottest Methods (after inlining)]..............................................................
>>>>>  90.86%              kernel  [unknown]
>>>>>   2.13%         c2, level 4  java.nio.file.Files::list, version 913
>>>>>   1.44%         c2, level 4 de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirJnr_jmhTest::testListDirJnr_avgt_jmhStub, version 934
>>>>>   0.81%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0, version 857
>>>>>   0.70%        libc-2.31.so  __close_nocancel
>>>>>   0.48%        libc-2.31.so  malloc
>>>>>   0.39%        libc-2.31.so  _int_free
>>>>>   0.32%        libc-2.31.so  _int_malloc
>>>>>   0.29%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0, version 879
>>>>>   0.22%        libc-2.31.so  __close
>>>>>   0.21%        libc-2.31.so  __GI___libc_open
>>>>>   0.19%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::fdopendir, version 859
>>>>>   0.16%        libc-2.31.so  __libc_enable_asynccancel
>>>>>   0.16%        libc-2.31.so  __GI___dup
>>>>>   0.15%        libc-2.31.so  __fxstat64
>>>>>   0.14%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir, version 878
>>>>>   0.14%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
>>>>>   0.13%        libc-2.31.so  __libc_disable_asynccancel
>>>>>   0.13%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
>>>>>   0.13%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
>>>>>   0.84%  <...other 69 warm methods...>
>>>>> ```
>>>>> 
>>>>> Panama:
>>>>> 
>>>>> ```
>>>>> ....[Hottest Methods (after inlining)]..............................................................
>>>>>  89.07%              kernel  [unknown]
>>>>>   2.35%         c2, level 4  java.nio.file.Files::list, version 890
>>>>>   1.45%         c2, level 4 de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirPanama_jmhTest::testListDirPanama_avgt_jmhStub, version 917
>>>>>   1.21%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0, version 849
>>>>>   0.62%        libc-2.31.so  malloc
>>>>>   0.60%        libc-2.31.so  __close_nocancel
>>>>>   0.47%        libc-2.31.so  _int_free
>>>>>   0.46%        libc-2.31.so  _int_malloc
>>>>>   0.41%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0, version 869
>>>>>   0.32%        libc-2.31.so  __close
>>>>>   0.32%        libc-2.31.so  __GI___libc_open
>>>>>   0.25%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::fdopendir, version 851
>>>>>   0.23%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
>>>>>   0.20%    Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::closedir, version 868
>>>>>   0.17%        libc-2.31.so  __GI___dup
>>>>>   0.15%        libc-2.31.so  __alloc_dir
>>>>>   0.15%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
>>>>>   0.15%        libc-2.31.so  __libc_disable_asynccancel
>>>>>   0.13%        libc-2.31.so  __libc_enable_asynccancel
>>>>>   0.12%           libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
>>>>>   1.17%  <...other 81 warm methods...>
>>>>> ```
>>>>> 
>>>>> So, no smoking gun so far. I'll keep looking.
>>>>> 
>>>>> Maurizio
>>>>> 
>>>>> On 15/07/2021 19:14, Maurizio Cimadamore wrote:
>>>>>> Ok, I got it working - what fixed it for me was that the offsets in the "filler" calls has to be set to zero, which, looking around, seems like a default value which leaves the kernel to take care of it.
>>>>>> 
>>>>>> ```
>>>>>> $ ls Volumes/foo
>>>>>> 
>>>>>> total 0
>>>>>> -r--r--r-- 1 root root  0 Jan  1  1970 aaa
>>>>>> -r--r--r-- 1 root root  0 Jan  1  1970 bbb
>>>>>> -r--r--r-- 1 root root  0 Jan  1  1970 ccc
>>>>>> -r--r--r-- 1 root root  0 Jan  1  1970 ddd
>>>>>> -r--r--r-- 1 root root 13 Jan  1  1970 hello.txt
>>>>>> -r--r--r-- 1 root root  0 Jan  1  1970 xxx
>>>>>> -r--r--r-- 1 root root  0 Jan  1  1970 yyy
>>>>>> -r--r--r-- 1 root root  0 Jan  1  1970 zzz
>>>>>> ```
>>>>>> 
>>>>>> Something is probably not right (see "total 0" on top), but perhaps should be good enough for benchmark. Applying same fix on the JNR also fixes that (and results in same output). Hopefully I'm good to go now :-)
>>>>>> 
>>>>>> Maurizio
>>>>>> 
>>>>>> On 15/07/2021 18:15, Maurizio Cimadamore wrote:
>>>>>>> I've re-extracted on Linux, and it seems more lively now. I had to fix a couple of type mismatches (e.g. long vs. int and int vs. short) in places, and also some of the fields in the "stat" structure are different, so the code won't compile as is.
>>>>>>> 
>>>>>>> After fixing these minor issue, I see a lot more output printed when I mount, and when I do an ls, I see the following lines reported:
>>>>>>> 
>>>>>>> [Thread-110] DEBUG de.skymatic.fusepanama.examples.HelloPanamaFileSystem - getattr() /
>>>>>>> [Thread-111] DEBUG de.skymatic.fusepanama.examples.HelloPanamaFileSystem - opendir() /
>>>>>>> [Thread-112] DEBUG de.skymatic.fusepanama.examples.HelloPanamaFileSystem - readdir() /
>>>>>>> [Thread-113] DEBUG de.skymatic.fusepanama.examples.HelloPanamaFileSystem - releasedir() /
>>>>>>> 
>>>>>>> But that's pretty much it, and in the terminal I still get the input/output error. I can even debug, but I can't see much of what's going wrong (and I'm not familiar with this API) - the Java code executes fine, for what it's worth.
>>>>>>> 
>>>>>>> Maurizio
>>>>>>> 
>>>>>>> On 15/07/2021 17:51, Sebastian Stenzel wrote:
>>>>>>>> I'll fix it for Linux and let you know!
>>>>>>>> 
>>>>>>>>> On 15. Jul 2021, at 18:04, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 15/07/2021 17:01, Sebastian Stenzel wrote:
>>>>>>>>>> Yeah well they're pretty much mac-specific. On macOS, FUSE has this magic behaviour where you can tell it to mount to a non-existing mountpoint inside of `/Volumes/...` and it'll just create these (and destroy them on unmount). I believe on Linux you need to define an _existing_ mount point. But it is surely possible that the volume isn't working yet on Linux. I'll give it a try myself and
>>>>>>>>> I created a folder Volumes under my home folder and trying to use that as a mount point, which seems to work ok with JNR.
>>>>>>>>>> fix it, if required.
>>>>>>>>>> 
>>>>>>>>>> The benchmark then needs to be adjusted for the two mountpoints respectively. But before the benchmark can actually do anything, a plain `ls` on the terminal needs to work.
>>>>>>>>> Ok, it seems even JNR fails:
>>>>>>>>> 
>>>>>>>>> ```
>>>>>>>>> $ ls Volumes/bar/
>>>>>>>>> ls: reading directory 'Volumes/bar/': Input/output error
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>>>> On 15. Jul 2021, at 17:55, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I tried to reproduce here on Linux, but with no luck - in the sense that I'm not super sure on how to run the benchmark.
>>>>>>>>>>> 
>>>>>>>>>>> I'm able somehow to run the two examples - and I noted that here JNR works ok, while the Panama one doesn't seem to mount things correctly - a new mount appears on my file explorer, but I'm unable to do anything with it (even unmount - which can only be done at sudo level).
>>>>>>>>>>> 
>>>>>>>>>>> When working with the JNR support, the mount works fine, it shows in the file explorer, I can click on that location and browse, and then unmount from there. Everything works.
>>>>>>>>>>> 
>>>>>>>>>>> That said, the benchmarks require the mount points to be up and running - so I've tried first to execute the example (e.g. JNR) and then run the benchmark in two separate terminal windows, all via Maven - but the benchmark doesn't seem to do anything (I've uncommented the benchmarks of course).
>>>>>>>>>>> 
>>>>>>>>>>> How do you run them?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Maurizio
>>>>>>>>>>> 
>>>>>>>>>>> On 15/07/2021 13:40, Maurizio Cimadamore wrote:
>>>>>>>>>>>> I believe that it would be more useful to try to run the perfasm profiler with JMH.
>>>>>>>>>>>> 
>>>>>>>>>>>> This can be done relatively easily, at least on linux, if you pass the argument `-prof perfasm` to JMH. (this would need hsdid-amd64.so on Linux to print readable assembly).
>>>>>>>>>>>> 
>>>>>>>>>>>> Another thing worth checking is allocation rate: `-prof gc`.
>>>>>>>>>>>> 
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>> 
>>>>>>>>>>>> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>>>>>>>>>>>>> Ok it really seems like VisualVM can't deal with these kinds of tasks yet. Now it reports the String constructor being the culprit [1], however I strongly doubt that, since this is probably one of the most heavily optimized parts of the JDK.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]: Screenshot onhttps://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$ <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$><https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$ <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>>
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Yes, must be a sampling error. Do you know of a (publicly available) _profiler_ that is compatible with JDK 17 / 18 already?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Aha - it seems like you are seeing what I was seeing: unrolling now seems to happen more reliably, which positively affect code like strlen.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As for FUSE, I think the reason for the difference has probably nothing to do with string conversion - the sampler profiler just happens to hit that code a lot. I checked JNR code for string conversion and I couldn't really find anything uber optimized in that regard that could explain the gap.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Probably something is not getting optimized as it should - likely a downcall/upcall intrinsification is failing - maybe due to a subtle issue with your code, or, possibly because you are hitting a non-implemented case (e.g. we do not intrinsify calls which pass arguments on the stack, yet), or because of some other bug.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>> Wow, I stand corrected. I just re-ran the benchmark and `benchmarkStrlenBase` just got a lot faster!! Your change inhttps://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$ <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$><https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$ <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$>> DID have an affect after all.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Just doesn't impress FUSE very much...
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yup, I tried the int-approach as well, but with worse results... Here is the full test:https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$ <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$><https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$ <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Ok. Thanks.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I tried similar experiments where instead of reading 4 bytes separately I'd read a single int value, and then use shifts and bitmasking to check for terminators. On paper good, but benchmark results were always worse than the version we have now (at least on Linux).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> That said, if you could please share the full string benchmark you have, that'd be helpful, so we can take a look at that, and see what's going wrong (ideally, C2 should be the one doing unrolling).
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>> I just did a quick snythetic test on a "manually unrolled" strlen() without any FUSE context.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I experimented with an implementation that looked like the following and benchmarked it using a 259 byte memory segment containing a 239 byte string (null byte at index 240):
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> private static int strlenUnroll4(MemorySegment segment, long start) {
>>>>>>>>>>>>>>>>>>> int offset;
>>>>>>>>>>>>>>>>>>> for (offset = 0; offset < segment.byteSize()-3; offset+=4) {
>>>>>>>>>>>>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + offset + 0);
>>>>>>>>>>>>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + offset + 1);
>>>>>>>>>>>>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + offset + 2);
>>>>>>>>>>>>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + offset + 3);
>>>>>>>>>>>>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is this even faster than directly having 4 different branches?
>>>>>>>>>>>>>>>>>>> if (b0 == 0) {
>>>>>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>>>>>> } else if (b1 == 0) {
>>>>>>>>>>>>>>>>>>> return offset + 1;
>>>>>>>>>>>>>>>>>>> } else if (b2 == 0) {
>>>>>>>>>>>>>>>>>>> return offset + 2;
>>>>>>>>>>>>>>>>>>> } else if (b3 == 0) {
>>>>>>>>>>>>>>>>>>> return offset + 3;
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no loop required for the remaining <4 bytes?
>>>>>>>>>>>>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + offset);
>>>>>>>>>>>>>>>>>>> if (b == 0) {
>>>>>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I'm not even sure how reliable my results are, since I have no clue about how branch prediction works here... Neither have I tested the correctness of this implementation.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com> <mailto:maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for reporting back.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> We probably need to investigate this a bit more deeply and try and reproduce on our side.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> One last question: you said that with manual unrolling you managed to get 2x faster: did you mean that string conversion got 2x faster or that you actually saw your FUSE benchmark going 2x faster because of the manual unrolling with strings?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>> That, surprisingly, didn't change anything either. But don't worry too much, the performance isn't bad (in absolute figures) and it is by far not the only reason why I consider panama the best solution to create java bindings for c libs.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com> <mailto:maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Actually, after some bisecting, I found out that the performance of converting a memory segment into a string jumped 2x faster with this fix:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>>
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Which was integrated after the one I originally pointed at. They both seem to touch loop optimization in case of overflows, which the strlen code is triggering (since the loop limit checks for loop variable being positive).
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> This is a simple patch which adds a string conversion test:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> diff --git a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>>>>>>>>>>>>> --- a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>>>>> +++ b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, true));
>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> +    MemorySegment segment;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>      @Setup
>>>>>>>>>>>>>>>>>>>>>>      public void setup() {
>>>>>>>>>>>>>>>>>>>>>>          str = makeString(size);
>>>>>>>>>>>>>>>>>>>>>>          segmentAllocator = SegmentAllocator.ofSegment(MemorySegment.allocateNative(size + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>>>>>>>>>>>>> +        segment = toCString(str, segmentAllocator);
>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>      @TearDown
>>>>>>>>>>>>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>>>>>>          scope.close();
>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> +    @Benchmark
>>>>>>>>>>>>>>>>>>>>>> +    public String panama_str_conv() throws Throwable {
>>>>>>>>>>>>>>>>>>>>>> +        return CLinker.toJavaString(segment);
>>>>>>>>>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>>>>>>>>>>>>      public int jni_strlen() throws Throwable {
>>>>>>>>>>>>>>>>>>>>>>          return strlen(str);
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt    Score   Error Units
>>>>>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  106.613 ? 7.060 ns/op
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt   Score   Error Units
>>>>>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  48.120 ? 0.557 ns/op
>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> So, as you can see, a pretty sizeable jump. Eyeballing, the shape of generated code doesn't look too different, which makes me think of another case where loop is unrolled, but main loop never executed (similar to JDK-8269230), but we'll need to look deeper.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for details how I built the JDK, see my initial email). Maybe I'm missing some compiler flags to enable all optimizations?
>>>>>>>>>>>>>>>>>>>>>>> I see - you do have the latest panama changes, but there has been a sync with upstream after that changeset, I believe - can you please try to resync with the latest foreign-jextract commit - which should be:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>>
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Are you sure about loop vectorization being applied to strlen? I'm not an expert on this field, but I had the impression this wasn't possible when the loop terminates "from within".
>>>>>>>>>>>>>>>>>>>>>>> Vlad is the expert here - when chatting offline he did mention that loop should have single exit - which I guess also takes into account the "normal" exit - so the strlen routine would seem to have two exits...
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>>>>> thanks for sharing your findings - I've done some attempts here with a targeted microbenchmark which measures the performance of string conversion and I'm seeing unrolling and vectorization being applied on the strlen computation.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not been updated in the last few weeks? There has been a C2 optimization fix which has been added recently, which I think might be related to this:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230 <https://bugs.openjdk.java.net/browse/JDK-8269230>
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> good idea, but it makes no difference beyond statistical error.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I started sampling the application with VisualVM (which is quite hard, since native threads are extremely short-lived. What I noticed is, that regardless of where the sampler interrupts a thread, in nearly all cases 100% of CPU time are caused by jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to the nature of null termination, but maybe we can make use of the fact that we're dealing with MemorySegments here: Since they protect us from overflows, maybe there is no need to look at only a single byte at a time. Maybe the strlen()-loop can be unrolled or even be vectorized.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup when doing a x4 loop unroll.
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code, one possible explanation for the discrepancy I can think of is that the DirFiller ends up using virtual downcalls to do it's work, which are currently not intrinsified. Being mostly a case of 'not implemented yet', i.e. it is a known issue.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>>>>      static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>          return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>>>>>              try {
>>>>>>>>>>>>>>>>>>>>>>>>>>>                  return (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, x0, x1, x2, x3); // <--------- 'addr' here is not a constant, so the call is virtual
>>>>>>>>>>>>>>>>>>>>>>>>>>>              } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>>>>>                  throw new AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>>>>>              }
>>>>>>>>>>>>>>>>>>>>>>>>>>>          };
>>>>>>>>>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> For testing purposes, a possible workaround could be to have a cache that maps the callback address to a method handle that has the address bound to the first parameter. Assuming readdir always gets the same filler callback address, the same MethodHandle will be reused and eventually customized which means the callback address will become constant, and the downcall should then be intrinsified.
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine to test this, but if you want to try it out, the patch should be this:
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>>>> diff --git a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>>>>>>>>>>>>> --- a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>>>>> +++ b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>>>>>>>>>>>>   package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>>>>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>>>>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>>>>>>   import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>>>>>>>>>>>>   import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>>>>>>>>>>>>   public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>>>>>>           return RuntimeHelper.upcallStub(fuse_fill_dir_t.class, fi, constants$0.fuse_fill_dir_t$FUNC, "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", scope);
>>>>>>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>>>>>>>       static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>>>>>> -        return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>>>>> -            try {
>>>>>>>>>>>>>>>>>>>>>>>>>>> -                return (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>>>>>> -            } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>>>>> -                throw new AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>>>>>>>>>>>>>> -        };
>>>>>>>>>>>>>>>>>>>>>>>>>>> +        class CacheHolder {
>>>>>>>>>>>>>>>>>>>>>>>>>>> +            static final Map<MemoryAddress, fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>>>>>>>>>>>> +        return CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>>>>>>>>>>>>>>>>>>>>>> +            final MethodHandle target = MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 0, addrK);
>>>>>>>>>>>>>>>>>>>>>>>>>>> +            return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>>>>> +                try {
>>>>>>>>>>>>>>>>>>>>>>>>>>> +                    return (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>>>>>> +                } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>>>>> +                    throw new AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>>>>> +                }
>>>>>>>>>>>>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>>>>>>>>>>>>> +        });
>>>>>>>>>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too much by line wrapping)
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark test, that includes several down- and upcalls. First, let me explain, what I'm testing here:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, mostly for experimental purposes right now, and I'm trying to beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> While there are some other interesting metrics, such as read/write performance (both sequentially and random access), I focused on directory listings for now. Directory listings are the most complex operation in regards to the number of down- and upcalls:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback function
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in the directory
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no longer required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces additional noise (such as readxattr and trying to access files that I didn't report in readdir))
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this: `Files.list(Path.of("/Volumes/foo")).close();` with the volume reporting eight files [2]. When mounting with debug logs enabled, I can see that the exact same operations in the same order are invoked on both fuse-jnr and fuse-panama. One single dir listing results in 2 readdir upcalls, 10 callback downcalls, 16 getattr upcalls. There are also 8 getxattr calls and 16 lookup calls, however they don't reach Java, as the FUSE kernel knows they are not implemented.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Benchmark                        Mode  Cnt    Score Error  Units
>>>>>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5   66,569 ± 3,128  us/op
>>>>>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5  189,340 ± 4,275  us/op
>>>>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit 42e03fd7c6a built with: `configure --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ --with-native-debug-symbols=none --with-debug-level=release --with-libclang=/usr/local/opt/llvm --with-libclang-version=12`
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from. Maybe creating a newConfinedScope() during each upcall [3] is "too much"? Maybe JNR is just negligently skipping some memory boundary checks to be faster. The results are not terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ >
>>>>>>>>>>>>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ >



More information about the panama-dev mailing list