Real-Life Benchmark for FUSE's readdir()

Thu Jul 15 11:08:57 UTC 2021

Aha - it seems like you are seeing what I was seeing: unrolling now 
seems to happen more reliably, which positively affect code like strlen.

As for FUSE, I think the reason for the difference has probably nothing 
to do with string conversion - the sampler profiler just happens to hit 
that code a lot. I checked JNR code for string conversion and I couldn't 
really find anything uber optimized in that regard that could explain 
the gap.

Probably something is not getting optimized as it should - likely a 
downcall/upcall intrinsification is failing - maybe due to a subtle 
issue with your code, or, possibly because you are hitting a 
non-implemented case (e.g. we do not intrinsify calls which pass 
arguments on the stack, yet), or because of some other bug.

Maurizio

On 15/07/2021 12:03, Sebastian Stenzel wrote:
> Wow, I stand corrected. I just re-ran the benchmark and 
> `benchmarkStrlenBase` just got a lot faster!! Your change in 
> https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a 
> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$> DID 
> have an affect after all.
>
> Just doesn't impress FUSE very much...
>
>> On 15. Jul 2021, at 13:00, Sebastian Stenzel 
>> <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>> wrote:
>>
>> Yup, I tried the int-approach as well, but with worse results... Here 
>> is the full test: 
>> https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6 
>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>
>>
>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore 
>>> <maurizio.cimadamore at oracle.com 
>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>
>>> Ok. Thanks.
>>>
>>> I tried similar experiments where instead of reading 4 bytes 
>>> separately I'd read a single int value, and then use shifts and 
>>> bitmasking to check for terminators. On paper good, but benchmark 
>>> results were always worse than the version we have now (at least on 
>>> Linux).
>>>
>>> That said, if you could please share the full string benchmark you 
>>> have, that'd be helpful, so we can take a look at that, and see 
>>> what's going wrong (ideally, C2 should be the one doing unrolling).
>>>
>>> Maurizio
>>>
>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>> I just did a quick snythetic test on a "manually unrolled" strlen() 
>>>> without any FUSE context.
>>>>
>>>> I experimented with an implementation that looked like the 
>>>> following and benchmarked it using a 259 byte memory segment 
>>>> containing a 239 byte string (null byte at index 240):
>>>>
>>>> ```
>>>> private static int strlenUnroll4(MemorySegment segment, long start) {
>>>> int offset;
>>>> for (offset = 0; offset < segment.byteSize()-3; offset+=4) {
>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + offset + 0);
>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + offset + 1);
>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + offset + 2);
>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + offset + 3);
>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is this even 
>>>> faster than directly having 4 different branches?
>>>> if (b0 == 0) {
>>>> return offset;
>>>> } else if (b1 == 0) {
>>>> return offset + 1;
>>>> } else if (b2 == 0) {
>>>> return offset + 2;
>>>> } else if (b3 == 0) {
>>>> return offset + 3;
>>>> }
>>>> }
>>>> }
>>>> while (offset < segment.byteSize()) { // TODO: maybe no loop 
>>>> required for the remaining <4 bytes?
>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + offset);
>>>> if (b == 0) {
>>>> return offset;
>>>> }
>>>> }
>>>> throw new IllegalArgumentException("String too large");
>>>> }
>>>> ```
>>>>
>>>> I'm not even sure how reliable my results are, since I have no clue 
>>>> about how branch prediction works here... Neither have I tested the 
>>>> correctness of this implementation.
>>>>
>>>>
>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore 
>>>>> <maurizio.cimadamore at oracle.com 
>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>
>>>>> Thanks for reporting back.
>>>>>
>>>>> We probably need to investigate this a bit more deeply and try and 
>>>>> reproduce on our side.
>>>>>
>>>>> One last question: you said that with manual unrolling you managed 
>>>>> to get 2x faster: did you mean that string conversion got 2x 
>>>>> faster or that you actually saw your FUSE benchmark going 2x 
>>>>> faster because of the manual unrolling with strings?
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>> That, surprisingly, didn't change anything either. But don't 
>>>>>> worry too much, the performance isn't bad (in absolute figures) 
>>>>>> and it is by far not the only reason why I consider panama the 
>>>>>> best solution to create java bindings for c libs.
>>>>>>
>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore 
>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>
>>>>>>> Actually, after some bisecting, I found out that the performance 
>>>>>>> of converting a memory segment into a string jumped 2x faster 
>>>>>>> with this fix:
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>
>>>>>>>
>>>>>>> Which was integrated after the one I originally pointed at. They 
>>>>>>> both seem to touch loop optimization in case of overflows, which 
>>>>>>> the strlen code is triggering (since the loop limit checks for 
>>>>>>> loop variable being positive).
>>>>>>>
>>>>>>> This is a simple patch which adds a string conversion test:
>>>>>>>
>>>>>>> ```
>>>>>>> diff --git 
>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>> --- 
>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>> +++ 
>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, 
>>>>>>> true));
>>>>>>>      }
>>>>>>>
>>>>>>> +    MemorySegment segment;
>>>>>>> +
>>>>>>>      @Setup
>>>>>>>      public void setup() {
>>>>>>>          str = makeString(size);
>>>>>>>          segmentAllocator = 
>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size + 
>>>>>>> 1, ResourceScope.newImplicitScope()));
>>>>>>> +        segment = toCString(str, segmentAllocator);
>>>>>>>      }
>>>>>>>
>>>>>>>      @TearDown
>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>          scope.close();
>>>>>>>      }
>>>>>>>
>>>>>>> +    @Benchmark
>>>>>>> +    public String panama_str_conv() throws Throwable {
>>>>>>> +        return CLinker.toJavaString(segment);
>>>>>>> +    }
>>>>>>> +
>>>>>>>      @Benchmark
>>>>>>>      public int jni_strlen() throws Throwable {
>>>>>>>          return strlen(str);
>>>>>>> ```
>>>>>>>
>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>
>>>>>>> ```
>>>>>>> Benchmark                   (size)  Mode  Cnt    Score   Error Units
>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  106.613 ? 7.060 ns/op
>>>>>>> ```
>>>>>>>
>>>>>>> While after the fix I get this:
>>>>>>>
>>>>>>> ```
>>>>>>> Benchmark                   (size)  Mode  Cnt   Score   Error Units
>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  48.120 ? 0.557 ns/op
>>>>>>> ```
>>>>>>>
>>>>>>> So, as you can see, a pretty sizeable jump. Eyeballing, the 
>>>>>>> shape of generated code doesn't look too different, which makes 
>>>>>>> me think of another case where loop is unrolled, but main loop 
>>>>>>> never executed (similar to JDK-8269230), but we'll need to look 
>>>>>>> deeper.
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>> Hey Maurizio,
>>>>>>>>>
>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for details 
>>>>>>>>> how I built the JDK, see my initial email). Maybe I'm missing 
>>>>>>>>> some compiler flags to enable all optimizations?
>>>>>>>> I see - you do have the latest panama changes, but there has 
>>>>>>>> been a sync with upstream after that changeset, I believe - can 
>>>>>>>> you please try to resync with the latest foreign-jextract 
>>>>>>>> commit - which should be:
>>>>>>>>
>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>
>>>>>>>>
>>>>>>>>
>>>>>>>>> Are you sure about loop vectorization being applied to strlen? 
>>>>>>>>> I'm not an expert on this field, but I had the impression this 
>>>>>>>>> wasn't possible when the loop terminates "from within".
>>>>>>>> Vlad is the expert here - when chatting offline he did mention 
>>>>>>>> that loop should have single exit - which I guess also takes 
>>>>>>>> into account the "normal" exit - so the strlen routine would 
>>>>>>>> seem to have two exits...
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Sebastian
>>>>>>>>>
>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore 
>>>>>>>>>> <maurizio.cimadamore at oracle.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Sebastian,
>>>>>>>>>> thanks for sharing your findings - I've done some attempts 
>>>>>>>>>> here with a targeted microbenchmark which measures the 
>>>>>>>>>> performance of string conversion and I'm seeing unrolling and 
>>>>>>>>>> vectorization being applied on the strlen computation.
>>>>>>>>>>
>>>>>>>>>> May I ask if, by any chance, your HEAD has not been updated 
>>>>>>>>>> in the last few weeks? There has been a C2 optimization fix 
>>>>>>>>>> which has been added recently, which I think might be related 
>>>>>>>>>> to this:
>>>>>>>>>>
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>>>>>>
>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> good idea, but it makes no difference beyond statistical error.
>>>>>>>>>>>
>>>>>>>>>>> I started sampling the application with VisualVM (which is 
>>>>>>>>>>> quite hard, since native threads are extremely short-lived. 
>>>>>>>>>>> What I noticed is, that regardless of where the sampler 
>>>>>>>>>>> interrupts a thread, in nearly all cases 100% of CPU time 
>>>>>>>>>>> are caused by 
>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() 
>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>
>>>>>>>>>>> I know that strlen can hardly be optimized due to the nature 
>>>>>>>>>>> of null termination, but maybe we can make use of the fact 
>>>>>>>>>>> that we're dealing with MemorySegments here: Since they 
>>>>>>>>>>> protect us from overflows, maybe there is no need to look at 
>>>>>>>>>>> only a single byte at a time. Maybe the strlen()-loop can be 
>>>>>>>>>>> unrolled or even be vectorized.
>>>>>>>>>>>
>>>>>>>>>>> I just did a quick test and observed a x2 speedup when doing 
>>>>>>>>>>> a x4 loop unroll.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Sebastian
>>>>>>>>>>>
>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee 
>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for testing this. Looking at your code, one possible 
>>>>>>>>>>>> explanation for the discrepancy I can think of is that the 
>>>>>>>>>>>> DirFiller ends up using virtual downcalls to do it's work, 
>>>>>>>>>>>> which are currently not intrinsified. Being mostly a case 
>>>>>>>>>>>> of 'not implemented yet', i.e. it is a known issue.
>>>>>>>>>>>>
>>>>>>>>>>>> ```
>>>>>>>>>>>>      static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>          return (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>              try {
>>>>>>>>>>>>                  return 
>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is not a 
>>>>>>>>>>>> constant, so the call is virtual
>>>>>>>>>>>>              } catch (Throwable ex$) {
>>>>>>>>>>>>                  throw new AssertionError("should not reach 
>>>>>>>>>>>> here", ex$);
>>>>>>>>>>>>              }
>>>>>>>>>>>>          };
>>>>>>>>>>>>      }
>>>>>>>>>>>> ```
>>>>>>>>>>>>
>>>>>>>>>>>> For testing purposes, a possible workaround could be to 
>>>>>>>>>>>> have a cache that maps the callback address to a method 
>>>>>>>>>>>> handle that has the address bound to the first parameter. 
>>>>>>>>>>>> Assuming readdir always gets the same filler callback 
>>>>>>>>>>>> address, the same MethodHandle will be reused and 
>>>>>>>>>>>> eventually customized which means the callback address will 
>>>>>>>>>>>> become constant, and the downcall should then be intrinsified.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't currently have access to a Mac machine to test 
>>>>>>>>>>>> this, but if you want to try it out, the patch should be this:
>>>>>>>>>>>>
>>>>>>>>>>>> ```
>>>>>>>>>>>> diff --git 
>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>> --- 
>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>> +++ 
>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>   package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>
>>>>>>>>>>>>   import java.lang.invoke.MethodHandle;
>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>   import java.lang.invoke.VarHandle;
>>>>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>> +
>>>>>>>>>>>>   import jdk.incubator.foreign.*;
>>>>>>>>>>>>   import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>   public interface fuse_fill_dir_t {
>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface fuse_fill_dir_t {
>>>>>>>>>>>>           return 
>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class, fi, 
>>>>>>>>>>>> constants$0.fuse_fill_dir_t$FUNC, 
>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", 
>>>>>>>>>>>> scope);
>>>>>>>>>>>>       }
>>>>>>>>>>>>       static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>>>>>> -        return (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>> -            try {
>>>>>>>>>>>> -                return 
>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>> -            } catch (Throwable ex$) {
>>>>>>>>>>>> -                throw new AssertionError("should not reach 
>>>>>>>>>>>> here", ex$);
>>>>>>>>>>>> -            }
>>>>>>>>>>>> -        };
>>>>>>>>>>>> +        class CacheHolder {
>>>>>>>>>>>> +            static final Map<MemoryAddress, 
>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>> +        }
>>>>>>>>>>>> +        return CacheHolder.CACHE.computeIfAbsent(addr, 
>>>>>>>>>>>> addrK -> {
>>>>>>>>>>>> +            final MethodHandle target = 
>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 
>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>> +            return (jdk.incubator.foreign.MemoryAddress 
>>>>>>>>>>>> x0, jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>> +                try {
>>>>>>>>>>>> +                    return (int)target.invokeExact(x0, x1, 
>>>>>>>>>>>> x2, x3);
>>>>>>>>>>>> +                } catch (Throwable ex$) {
>>>>>>>>>>>> +                    throw new AssertionError("should not 
>>>>>>>>>>>> reach here", ex$);
>>>>>>>>>>>> +                }
>>>>>>>>>>>> +            };
>>>>>>>>>>>> +        });
>>>>>>>>>>>>       }
>>>>>>>>>>>>   }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ```
>>>>>>>>>>>> (I hope these code blocks don't get mangled too much by 
>>>>>>>>>>>> line wrapping)
>>>>>>>>>>>>
>>>>>>>>>>>> HTH,
>>>>>>>>>>>> Jorn
>>>>>>>>>>>>
>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wanted to share the results of a benchmark test, that 
>>>>>>>>>>>>> includes several down- and upcalls. First, let me explain, 
>>>>>>>>>>>>> what I'm testing here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, mostly for 
>>>>>>>>>>>>> experimental purposes right now, and I'm trying to beat 
>>>>>>>>>>>>> fuse-jnr [1].
>>>>>>>>>>>>>
>>>>>>>>>>>>> While there are some other interesting metrics, such as 
>>>>>>>>>>>>> read/write performance (both sequentially and random 
>>>>>>>>>>>>> access), I focused on directory listings for now. 
>>>>>>>>>>>>> Directory listings are the most complex operation in 
>>>>>>>>>>>>> regards to the number of down- and upcalls:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback function
>>>>>>>>>>>>> 2. java downcalls the callback for each item in the directory
>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no longer required 
>>>>>>>>>>>>> with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces additional 
>>>>>>>>>>>>> noise (such as readxattr and trying to access files that I 
>>>>>>>>>>>>> didn't report in readdir))
>>>>>>>>>>>>>
>>>>>>>>>>>>> So, what I'm testing is essentially this: 
>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();` with the 
>>>>>>>>>>>>> volume reporting eight files [2]. When mounting with debug 
>>>>>>>>>>>>> logs enabled, I can see that the exact same operations in 
>>>>>>>>>>>>> the same order are invoked on both fuse-jnr and 
>>>>>>>>>>>>> fuse-panama. One single dir listing results in 2 readdir 
>>>>>>>>>>>>> upcalls, 10 callback downcalls, 16 getattr upcalls. There 
>>>>>>>>>>>>> are also 8 getxattr calls and 16 lookup calls, however 
>>>>>>>>>>>>> they don't reach Java, as the FUSE kernel knows they are 
>>>>>>>>>>>>> not implemented.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ```
>>>>>>>>>>>>> Benchmark                        Mode  Cnt    Score Error 
>>>>>>>>>>>>>  Units
>>>>>>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5   66,569 ± 
>>>>>>>>>>>>> 3,128  us/op
>>>>>>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5  189,340 ± 
>>>>>>>>>>>>> 4,275  us/op
>>>>>>>>>>>>> ```
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've been using panama snapshot at commit 42e03fd7c6a 
>>>>>>>>>>>>> built with: `configure 
>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ 
>>>>>>>>>>>>> --with-native-debug-symbols=none 
>>>>>>>>>>>>> --with-debug-level=release 
>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm 
>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>
>>>>>>>>>>>>> I can't tell where this overhead comes from. Maybe 
>>>>>>>>>>>>> creating a newConfinedScope() during each upcall [3] is 
>>>>>>>>>>>>> "too much"? Maybe JNR is just negligently skipping some 
>>>>>>>>>>>>> memory boundary checks to be faster. The results are not 
>>>>>>>>>>>>> terrible, but I'd hoped for something better.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>> >
>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>> >
>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>> >
>>
>