Real-Life Benchmark for FUSE's readdir()

Thu Jul 15 15:55:34 UTC 2021

I tried to reproduce here on Linux, but with no luck - in the sense that 
I'm not super sure on how to run the benchmark.

I'm able somehow to run the two examples - and I noted that here JNR 
works ok, while the Panama one doesn't seem to mount things correctly - 
a new mount appears on my file explorer, but I'm unable to do anything 
with it (even unmount - which can only be done at sudo level).

When working with the JNR support, the mount works fine, it shows in the 
file explorer, I can click on that location and browse, and then unmount 
from there. Everything works.

That said, the benchmarks require the mount points to be up and running 
- so I've tried first to execute the example (e.g. JNR) and then run the 
benchmark in two separate terminal windows, all via Maven - but the 
benchmark doesn't seem to do anything (I've uncommented the benchmarks 
of course).

How do you run them?

Thanks
Maurizio

On 15/07/2021 13:40, Maurizio Cimadamore wrote:
> I believe that it would be more useful to try to run the perfasm 
> profiler with JMH.
>
> This can be done relatively easily, at least on linux, if you pass the 
> argument `-prof perfasm` to JMH. (this would need hsdid-amd64.so on 
> Linux to print readable assembly).
>
> Another thing worth checking is allocation rate: `-prof gc`.
>
> Maurizio
>
> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>> Ok it really seems like VisualVM can't deal with these kinds of tasks 
>> yet. Now it reports the String constructor being the culprit [1], 
>> however I strongly doubt that, since this is probably one of the most 
>> heavily optimized parts of the JDK.
>>
>> [1]: Screenshot on https://imgur.com/a/SHG8RSQ 
>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>
>>
>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel 
>>> <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>> 
>>> wrote:
>>>
>>> Yes, must be a sampling error. Do you know of a (publicly available) 
>>> _profiler_ that is compatible with JDK 17 / 18 already?
>>>
>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore 
>>>> <maurizio.cimadamore at oracle.com 
>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>
>>>> Aha - it seems like you are seeing what I was seeing: unrolling now 
>>>> seems to happen more reliably, which positively affect code like 
>>>> strlen.
>>>>
>>>> As for FUSE, I think the reason for the difference has probably 
>>>> nothing to do with string conversion - the sampler profiler just 
>>>> happens to hit that code a lot. I checked JNR code for string 
>>>> conversion and I couldn't really find anything uber optimized in 
>>>> that regard that could explain the gap.
>>>>
>>>> Probably something is not getting optimized as it should - likely a 
>>>> downcall/upcall intrinsification is failing - maybe due to a subtle 
>>>> issue with your code, or, possibly because you are hitting a 
>>>> non-implemented case (e.g. we do not intrinsify calls which pass 
>>>> arguments on the stack, yet), or because of some other bug.
>>>>
>>>> Maurizio
>>>>
>>>>
>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>> Wow, I stand corrected. I just re-ran the benchmark and 
>>>>> `benchmarkStrlenBase` just got a lot faster!! Your change in 
>>>>> https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a 
>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$> DID 
>>>>> have an affect after all.
>>>>>
>>>>> Just doesn't impress FUSE very much...
>>>>>
>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel 
>>>>>> <sebastian.stenzel at gmail.com 
>>>>>> <mailto:sebastian.stenzel at gmail.com>> wrote:
>>>>>>
>>>>>> Yup, I tried the int-approach as well, but with worse results... 
>>>>>> Here is the full test: 
>>>>>> https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6 
>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$> 
>>>>>>
>>>>>>
>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore 
>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>
>>>>>>> Ok. Thanks.
>>>>>>>
>>>>>>> I tried similar experiments where instead of reading 4 bytes 
>>>>>>> separately I'd read a single int value, and then use shifts and 
>>>>>>> bitmasking to check for terminators. On paper good, but 
>>>>>>> benchmark results were always worse than the version we have now 
>>>>>>> (at least on Linux).
>>>>>>>
>>>>>>> That said, if you could please share the full string benchmark 
>>>>>>> you have, that'd be helpful, so we can take a look at that, and 
>>>>>>> see what's going wrong (ideally, C2 should be the one doing 
>>>>>>> unrolling).
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>> I just did a quick snythetic test on a "manually unrolled" 
>>>>>>>> strlen() without any FUSE context.
>>>>>>>>
>>>>>>>> I experimented with an implementation that looked like the 
>>>>>>>> following and benchmarked it using a 259 byte memory segment 
>>>>>>>> containing a 239 byte string (null byte at index 240):
>>>>>>>>
>>>>>>>> ```
>>>>>>>> private static int strlenUnroll4(MemorySegment segment, long 
>>>>>>>> start) {
>>>>>>>> int offset;
>>>>>>>> for (offset = 0; offset < segment.byteSize()-3; offset+=4) {
>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + offset 
>>>>>>>> + 0);
>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + offset 
>>>>>>>> + 1);
>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + offset 
>>>>>>>> + 2);
>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + offset 
>>>>>>>> + 3);
>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is this even 
>>>>>>>> faster than directly having 4 different branches?
>>>>>>>> if (b0 == 0) {
>>>>>>>> return offset;
>>>>>>>> } else if (b1 == 0) {
>>>>>>>> return offset + 1;
>>>>>>>> } else if (b2 == 0) {
>>>>>>>> return offset + 2;
>>>>>>>> } else if (b3 == 0) {
>>>>>>>> return offset + 3;
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no loop 
>>>>>>>> required for the remaining <4 bytes?
>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + offset);
>>>>>>>> if (b == 0) {
>>>>>>>> return offset;
>>>>>>>> }
>>>>>>>> }
>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>> }
>>>>>>>> ```
>>>>>>>>
>>>>>>>> I'm not even sure how reliable my results are, since I have no 
>>>>>>>> clue about how branch prediction works here... Neither have I 
>>>>>>>> tested the correctness of this implementation.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore 
>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks for reporting back.
>>>>>>>>>
>>>>>>>>> We probably need to investigate this a bit more deeply and try 
>>>>>>>>> and reproduce on our side.
>>>>>>>>>
>>>>>>>>> One last question: you said that with manual unrolling you 
>>>>>>>>> managed to get 2x faster: did you mean that string conversion 
>>>>>>>>> got 2x faster or that you actually saw your FUSE benchmark 
>>>>>>>>> going 2x faster because of the manual unrolling with strings?
>>>>>>>>>
>>>>>>>>> Maurizio
>>>>>>>>>
>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>> That, surprisingly, didn't change anything either. But don't 
>>>>>>>>>> worry too much, the performance isn't bad (in absolute 
>>>>>>>>>> figures) and it is by far not the only reason why I consider 
>>>>>>>>>> panama the best solution to create java bindings for c libs.
>>>>>>>>>>
>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore 
>>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Actually, after some bisecting, I found out that the 
>>>>>>>>>>> performance of converting a memory segment into a string 
>>>>>>>>>>> jumped 2x faster with this fix:
>>>>>>>>>>>
>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$> 
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Which was integrated after the one I originally pointed at. 
>>>>>>>>>>> They both seem to touch loop optimization in case of 
>>>>>>>>>>> overflows, which the strlen code is triggering (since the 
>>>>>>>>>>> loop limit checks for loop variable being positive).
>>>>>>>>>>>
>>>>>>>>>>> This is a simple patch which adds a string conversion test:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> diff --git 
>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>
>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>> --- 
>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>
>>>>>>>>>>> +++ 
>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>>>
>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, 
>>>>>>>>>>> true));
>>>>>>>>>>>      }
>>>>>>>>>>>
>>>>>>>>>>> +    MemorySegment segment;
>>>>>>>>>>> +
>>>>>>>>>>>      @Setup
>>>>>>>>>>>      public void setup() {
>>>>>>>>>>>          str = makeString(size);
>>>>>>>>>>>          segmentAllocator = 
>>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size 
>>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>> +        segment = toCString(str, segmentAllocator);
>>>>>>>>>>>      }
>>>>>>>>>>>
>>>>>>>>>>>      @TearDown
>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>          scope.close();
>>>>>>>>>>>      }
>>>>>>>>>>>
>>>>>>>>>>> +    @Benchmark
>>>>>>>>>>> +    public String panama_str_conv() throws Throwable {
>>>>>>>>>>> +        return CLinker.toJavaString(segment);
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>      public int jni_strlen() throws Throwable {
>>>>>>>>>>>          return strlen(str);
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt    Score 
>>>>>>>>>>>   Error Units
>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  106.613 ? 
>>>>>>>>>>> 7.060 ns/op
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt   Score 
>>>>>>>>>>>   Error Units
>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  48.120 ? 
>>>>>>>>>>> 0.557 ns/op
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> So, as you can see, a pretty sizeable jump. Eyeballing, the 
>>>>>>>>>>> shape of generated code doesn't look too different, which 
>>>>>>>>>>> makes me think of another case where loop is unrolled, but 
>>>>>>>>>>> main loop never executed (similar to JDK-8269230), but we'll 
>>>>>>>>>>> need to look deeper.
>>>>>>>>>>>
>>>>>>>>>>> Maurizio
>>>>>>>>>>>
>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>
>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for 
>>>>>>>>>>>>> details how I built the JDK, see my initial email). Maybe 
>>>>>>>>>>>>> I'm missing some compiler flags to enable all optimizations?
>>>>>>>>>>>> I see - you do have the latest panama changes, but there 
>>>>>>>>>>>> has been a sync with upstream after that changeset, I 
>>>>>>>>>>>> believe - can you please try to resync with the latest 
>>>>>>>>>>>> foreign-jextract commit - which should be:
>>>>>>>>>>>>
>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$> 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Are you sure about loop vectorization being applied to 
>>>>>>>>>>>>> strlen? I'm not an expert on this field, but I had the 
>>>>>>>>>>>>> impression this wasn't possible when the loop terminates 
>>>>>>>>>>>>> "from within".
>>>>>>>>>>>> Vlad is the expert here - when chatting offline he did 
>>>>>>>>>>>> mention that loop should have single exit - which I guess 
>>>>>>>>>>>> also takes into account the "normal" exit - so the strlen 
>>>>>>>>>>>> routine would seem to have two exits...
>>>>>>>>>>>>
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore 
>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>> thanks for sharing your findings - I've done some 
>>>>>>>>>>>>>> attempts here with a targeted microbenchmark which 
>>>>>>>>>>>>>> measures the performance of string conversion and I'm 
>>>>>>>>>>>>>> seeing unrolling and vectorization being applied on the 
>>>>>>>>>>>>>> strlen computation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not been 
>>>>>>>>>>>>>> updated in the last few weeks? There has been a C2 
>>>>>>>>>>>>>> optimization fix which has been added recently, which I 
>>>>>>>>>>>>>> think might be related to this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> good idea, but it makes no difference beyond statistical 
>>>>>>>>>>>>>>> error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I started sampling the application with VisualVM (which 
>>>>>>>>>>>>>>> is quite hard, since native threads are extremely 
>>>>>>>>>>>>>>> short-lived. What I noticed is, that regardless of where 
>>>>>>>>>>>>>>> the sampler interrupts a thread, in nearly all cases 
>>>>>>>>>>>>>>> 100% of CPU time are caused by 
>>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() 
>>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to the 
>>>>>>>>>>>>>>> nature of null termination, but maybe we can make use of 
>>>>>>>>>>>>>>> the fact that we're dealing with MemorySegments here: 
>>>>>>>>>>>>>>> Since they protect us from overflows, maybe there is no 
>>>>>>>>>>>>>>> need to look at only a single byte at a time. Maybe the 
>>>>>>>>>>>>>>> strlen()-loop can be unrolled or even be vectorized.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup when 
>>>>>>>>>>>>>>> doing a x4 loop unroll.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee 
>>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code, one 
>>>>>>>>>>>>>>>> possible explanation for the discrepancy I can think of 
>>>>>>>>>>>>>>>> is that the DirFiller ends up using virtual downcalls 
>>>>>>>>>>>>>>>> to do it's work, which are currently not intrinsified. 
>>>>>>>>>>>>>>>> Being mostly a case of 'not implemented yet', i.e. it 
>>>>>>>>>>>>>>>> is a known issue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>      static fuse_fill_dir_t ofAddress(MemoryAddress 
>>>>>>>>>>>>>>>> addr) {
>>>>>>>>>>>>>>>>          return (jdk.incubator.foreign.MemoryAddress 
>>>>>>>>>>>>>>>> x0, jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>>              try {
>>>>>>>>>>>>>>>>                  return 
>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is not a 
>>>>>>>>>>>>>>>> constant, so the call is virtual
>>>>>>>>>>>>>>>>              } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>                  throw new AssertionError("should not 
>>>>>>>>>>>>>>>> reach here", ex$);
>>>>>>>>>>>>>>>>              }
>>>>>>>>>>>>>>>>          };
>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For testing purposes, a possible workaround could be to 
>>>>>>>>>>>>>>>> have a cache that maps the callback address to a method 
>>>>>>>>>>>>>>>> handle that has the address bound to the first 
>>>>>>>>>>>>>>>> parameter. Assuming readdir always gets the same filler 
>>>>>>>>>>>>>>>> callback address, the same MethodHandle will be reused 
>>>>>>>>>>>>>>>> and eventually customized which means the callback 
>>>>>>>>>>>>>>>> address will become constant, and the downcall should 
>>>>>>>>>>>>>>>> then be intrinsified.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine to test 
>>>>>>>>>>>>>>>> this, but if you want to try it out, the patch should 
>>>>>>>>>>>>>>>> be this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>> diff --git 
>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>> --- 
>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +++ 
>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>   package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>   import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>   import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>   import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>   public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>           return 
>>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class, fi, 
>>>>>>>>>>>>>>>> constants$0.fuse_fill_dir_t$FUNC, 
>>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", 
>>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>       static fuse_fill_dir_t ofAddress(MemoryAddress 
>>>>>>>>>>>>>>>> addr) {
>>>>>>>>>>>>>>>> -        return (jdk.incubator.foreign.MemoryAddress 
>>>>>>>>>>>>>>>> x0, jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>> -            try {
>>>>>>>>>>>>>>>> -                return 
>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>>> -            } catch (Throwable ex$) {
>>>>>>>>>>>>>>>> -                throw new AssertionError("should not 
>>>>>>>>>>>>>>>> reach here", ex$);
>>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>>> -        };
>>>>>>>>>>>>>>>> +        class CacheHolder {
>>>>>>>>>>>>>>>> +            static final Map<MemoryAddress, 
>>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>> +        return CacheHolder.CACHE.computeIfAbsent(addr, 
>>>>>>>>>>>>>>>> addrK -> {
>>>>>>>>>>>>>>>> +            final MethodHandle target = 
>>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 
>>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>>> +            return 
>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>> +                try {
>>>>>>>>>>>>>>>> +                    return (int)target.invokeExact(x0, 
>>>>>>>>>>>>>>>> x1, x2, x3);
>>>>>>>>>>>>>>>> +                } catch (Throwable ex$) {
>>>>>>>>>>>>>>>> +                    throw new AssertionError("should 
>>>>>>>>>>>>>>>> not reach here", ex$);
>>>>>>>>>>>>>>>> +                }
>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>> +        });
>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too much by 
>>>>>>>>>>>>>>>> line wrapping)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark test, 
>>>>>>>>>>>>>>>>> that includes several down- and upcalls. First, let me 
>>>>>>>>>>>>>>>>> explain, what I'm testing here:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, mostly for 
>>>>>>>>>>>>>>>>> experimental purposes right now, and I'm trying to 
>>>>>>>>>>>>>>>>> beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While there are some other interesting metrics, such 
>>>>>>>>>>>>>>>>> as read/write performance (both sequentially and 
>>>>>>>>>>>>>>>>> random access), I focused on directory listings for 
>>>>>>>>>>>>>>>>> now. Directory listings are the most complex operation 
>>>>>>>>>>>>>>>>> in regards to the number of down- and upcalls:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback function
>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in the 
>>>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no longer 
>>>>>>>>>>>>>>>>> required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces additional 
>>>>>>>>>>>>>>>>> noise (such as readxattr and trying to access files 
>>>>>>>>>>>>>>>>> that I didn't report in readdir))
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this: 
>>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();` with 
>>>>>>>>>>>>>>>>> the volume reporting eight files [2]. When mounting 
>>>>>>>>>>>>>>>>> with debug logs enabled, I can see that the exact same 
>>>>>>>>>>>>>>>>> operations in the same order are invoked on both 
>>>>>>>>>>>>>>>>> fuse-jnr and fuse-panama. One single dir listing 
>>>>>>>>>>>>>>>>> results in 2 readdir upcalls, 10 callback downcalls, 
>>>>>>>>>>>>>>>>> 16 getattr upcalls. There are also 8 getxattr calls 
>>>>>>>>>>>>>>>>> and 16 lookup calls, however they don't reach Java, as 
>>>>>>>>>>>>>>>>> the FUSE kernel knows they are not implemented.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>> Benchmark                        Mode  Cnt    Score 
>>>>>>>>>>>>>>>>> Error  Units
>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5   66,569 ± 
>>>>>>>>>>>>>>>>> 3,128  us/op
>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5  189,340 ± 
>>>>>>>>>>>>>>>>> 4,275  us/op
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit 42e03fd7c6a 
>>>>>>>>>>>>>>>>> built with: `configure 
>>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ 
>>>>>>>>>>>>>>>>> --with-native-debug-symbols=none 
>>>>>>>>>>>>>>>>> --with-debug-level=release 
>>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm 
>>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from. Maybe 
>>>>>>>>>>>>>>>>> creating a newConfinedScope() during each upcall [3] 
>>>>>>>>>>>>>>>>> is "too much"? Maybe JNR is just negligently skipping 
>>>>>>>>>>>>>>>>> some memory boundary checks to be faster. The results 
>>>>>>>>>>>>>>>>> are not terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>>> >
>>>>>>
>>>>>
>>>
>>