Real-Life Benchmark for FUSE's readdir()

Thu Jul 15 16:51:59 UTC 2021

I'll fix it for Linux and let you know!

> On 15. Jul 2021, at 18:04, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
> 
> 
> On 15/07/2021 17:01, Sebastian Stenzel wrote:
>> Yeah well they're pretty much mac-specific. On macOS, FUSE has this magic behaviour where you can tell it to mount to a non-existing mountpoint inside of `/Volumes/...` and it'll just create these (and destroy them on unmount). I believe on Linux you need to define an _existing_ mount point. But it is surely possible that the volume isn't working yet on Linux. I'll give it a try myself and
> I created a folder Volumes under my home folder and trying to use that as a mount point, which seems to work ok with JNR.
>> fix it, if required.
>> 
>> The benchmark then needs to be adjusted for the two mountpoints respectively. But before the benchmark can actually do anything, a plain `ls` on the terminal needs to work.
> 
> Ok, it seems even JNR fails:
> 
> ```
> $ ls Volumes/bar/
> ls: reading directory 'Volumes/bar/': Input/output error
> ```
> 
>> 
>>> On 15. Jul 2021, at 17:55, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>> 
>>> I tried to reproduce here on Linux, but with no luck - in the sense that I'm not super sure on how to run the benchmark.
>>> 
>>> I'm able somehow to run the two examples - and I noted that here JNR works ok, while the Panama one doesn't seem to mount things correctly - a new mount appears on my file explorer, but I'm unable to do anything with it (even unmount - which can only be done at sudo level).
>>> 
>>> When working with the JNR support, the mount works fine, it shows in the file explorer, I can click on that location and browse, and then unmount from there. Everything works.
>>> 
>>> That said, the benchmarks require the mount points to be up and running - so I've tried first to execute the example (e.g. JNR) and then run the benchmark in two separate terminal windows, all via Maven - but the benchmark doesn't seem to do anything (I've uncommented the benchmarks of course).
>>> 
>>> How do you run them?
>>> 
>>> Thanks
>>> Maurizio
>>> 
>>> On 15/07/2021 13:40, Maurizio Cimadamore wrote:
>>>> I believe that it would be more useful to try to run the perfasm profiler with JMH.
>>>> 
>>>> This can be done relatively easily, at least on linux, if you pass the argument `-prof perfasm` to JMH. (this would need hsdid-amd64.so on Linux to print readable assembly).
>>>> 
>>>> Another thing worth checking is allocation rate: `-prof gc`.
>>>> 
>>>> Maurizio
>>>> 
>>>> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>>>>> Ok it really seems like VisualVM can't deal with these kinds of tasks yet. Now it reports the String constructor being the culprit [1], however I strongly doubt that, since this is probably one of the most heavily optimized parts of the JDK.
>>>>> 
>>>>> [1]: Screenshot on https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$ <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$ <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>>
>>>>> 
>>>>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com> <mailto:sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>> 
>>>>>> Yes, must be a sampling error. Do you know of a (publicly available) _profiler_ that is compatible with JDK 17 / 18 already?
>>>>>> 
>>>>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com> <mailto:maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>> 
>>>>>>> Aha - it seems like you are seeing what I was seeing: unrolling now seems to happen more reliably, which positively affect code like strlen.
>>>>>>> 
>>>>>>> As for FUSE, I think the reason for the difference has probably nothing to do with string conversion - the sampler profiler just happens to hit that code a lot. I checked JNR code for string conversion and I couldn't really find anything uber optimized in that regard that could explain the gap.
>>>>>>> 
>>>>>>> Probably something is not getting optimized as it should - likely a downcall/upcall intrinsification is failing - maybe due to a subtle issue with your code, or, possibly because you are hitting a non-implemented case (e.g. we do not intrinsify calls which pass arguments on the stack, yet), or because of some other bug.
>>>>>>> 
>>>>>>> Maurizio
>>>>>>> 
>>>>>>> 
>>>>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>>>>> Wow, I stand corrected. I just re-ran the benchmark and `benchmarkStrlenBase` just got a lot faster!! Your change in https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$ <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$ <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$>> DID have an affect after all.
>>>>>>>> 
>>>>>>>> Just doesn't impress FUSE very much...
>>>>>>>> 
>>>>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com> <mailto:sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>> 
>>>>>>>>> Yup, I tried the int-approach as well, but with worse results... Here is the full test: https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$ <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$ <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>>
>>>>>>>>> 
>>>>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com> <mailto:maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Ok. Thanks.
>>>>>>>>>> 
>>>>>>>>>> I tried similar experiments where instead of reading 4 bytes separately I'd read a single int value, and then use shifts and bitmasking to check for terminators. On paper good, but benchmark results were always worse than the version we have now (at least on Linux).
>>>>>>>>>> 
>>>>>>>>>> That said, if you could please share the full string benchmark you have, that'd be helpful, so we can take a look at that, and see what's going wrong (ideally, C2 should be the one doing unrolling).
>>>>>>>>>> 
>>>>>>>>>> Maurizio
>>>>>>>>>> 
>>>>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>>>>> I just did a quick snythetic test on a "manually unrolled" strlen() without any FUSE context.
>>>>>>>>>>> 
>>>>>>>>>>> I experimented with an implementation that looked like the following and benchmarked it using a 259 byte memory segment containing a 239 byte string (null byte at index 240):
>>>>>>>>>>> 
>>>>>>>>>>> ```
>>>>>>>>>>> private static int strlenUnroll4(MemorySegment segment, long start) {
>>>>>>>>>>> int offset;
>>>>>>>>>>> for (offset = 0; offset < segment.byteSize()-3; offset+=4) {
>>>>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + offset + 0);
>>>>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + offset + 1);
>>>>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + offset + 2);
>>>>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + offset + 3);
>>>>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is this even faster than directly having 4 different branches?
>>>>>>>>>>> if (b0 == 0) {
>>>>>>>>>>> return offset;
>>>>>>>>>>> } else if (b1 == 0) {
>>>>>>>>>>> return offset + 1;
>>>>>>>>>>> } else if (b2 == 0) {
>>>>>>>>>>> return offset + 2;
>>>>>>>>>>> } else if (b3 == 0) {
>>>>>>>>>>> return offset + 3;
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no loop required for the remaining <4 bytes?
>>>>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + offset);
>>>>>>>>>>> if (b == 0) {
>>>>>>>>>>> return offset;
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>>>>> }
>>>>>>>>>>> ```
>>>>>>>>>>> 
>>>>>>>>>>> I'm not even sure how reliable my results are, since I have no clue about how branch prediction works here... Neither have I tested the correctness of this implementation.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks for reporting back.
>>>>>>>>>>>> 
>>>>>>>>>>>> We probably need to investigate this a bit more deeply and try and reproduce on our side.
>>>>>>>>>>>> 
>>>>>>>>>>>> One last question: you said that with manual unrolling you managed to get 2x faster: did you mean that string conversion got 2x faster or that you actually saw your FUSE benchmark going 2x faster because of the manual unrolling with strings?
>>>>>>>>>>>> 
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>> 
>>>>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>> That, surprisingly, didn't change anything either. But don't worry too much, the performance isn't bad (in absolute figures) and it is by far not the only reason why I consider panama the best solution to create java bindings for c libs.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore <maurizio.cimadamore at oracle.com <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Actually, after some bisecting, I found out that the performance of converting a memory segment into a string jumped 2x faster with this fix:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Which was integrated after the one I originally pointed at. They both seem to touch loop optimization in case of overflows, which the strlen code is triggering (since the loop limit checks for loop variable being positive).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is a simple patch which adds a string conversion test:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> diff --git a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>>>>> --- a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>> +++ b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, true));
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> +    MemorySegment segment;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>      @Setup
>>>>>>>>>>>>>>      public void setup() {
>>>>>>>>>>>>>>          str = makeString(size);
>>>>>>>>>>>>>>          segmentAllocator = SegmentAllocator.ofSegment(MemorySegment.allocateNative(size + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>>>>> +        segment = toCString(str, segmentAllocator);
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>      @TearDown
>>>>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>>>>          scope.close();
>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> +    @Benchmark
>>>>>>>>>>>>>> +    public String panama_str_conv() throws Throwable {
>>>>>>>>>>>>>> +        return CLinker.toJavaString(segment);
>>>>>>>>>>>>>> +    }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>>>>      public int jni_strlen() throws Throwable {
>>>>>>>>>>>>>>          return strlen(str);
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt    Score   Error Units
>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  106.613 ? 7.060 ns/op
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> Benchmark                   (size)  Mode  Cnt   Score   Error Units
>>>>>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  48.120 ? 0.557 ns/op
>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So, as you can see, a pretty sizeable jump. Eyeballing, the shape of generated code doesn't look too different, which makes me think of another case where loop is unrolled, but main loop never executed (similar to JDK-8269230), but we'll need to look deeper.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for details how I built the JDK, see my initial email). Maybe I'm missing some compiler flags to enable all optimizations?
>>>>>>>>>>>>>>> I see - you do have the latest panama changes, but there has been a sync with upstream after that changeset, I believe - can you please try to resync with the latest foreign-jextract commit - which should be:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Are you sure about loop vectorization being applied to strlen? I'm not an expert on this field, but I had the impression this wasn't possible when the loop terminates "from within".
>>>>>>>>>>>>>>> Vlad is the expert here - when chatting offline he did mention that loop should have single exit - which I guess also takes into account the "normal" exit - so the strlen routine would seem to have two exits...
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>> thanks for sharing your findings - I've done some attempts here with a targeted microbenchmark which measures the performance of string conversion and I'm seeing unrolling and vectorization being applied on the strlen computation.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not been updated in the last few weeks? There has been a C2 optimization fix which has been added recently, which I think might be related to this:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> good idea, but it makes no difference beyond statistical error.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I started sampling the application with VisualVM (which is quite hard, since native threads are extremely short-lived. What I noticed is, that regardless of where the sampler interrupts a thread, in nearly all cases 100% of CPU time are caused by jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to the nature of null termination, but maybe we can make use of the fact that we're dealing with MemorySegments here: Since they protect us from overflows, maybe there is no need to look at only a single byte at a time. Maybe the strlen()-loop can be unrolled or even be vectorized.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup when doing a x4 loop unroll.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code, one possible explanation for the discrepancy I can think of is that the DirFiller ends up using virtual downcalls to do it's work, which are currently not intrinsified. Being mostly a case of 'not implemented yet', i.e. it is a known issue.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>      static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>          return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>>>>>              try {
>>>>>>>>>>>>>>>>>>>                  return (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, x0, x1, x2, x3); // <--------- 'addr' here is not a constant, so the call is virtual
>>>>>>>>>>>>>>>>>>>              } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>                  throw new AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>              }
>>>>>>>>>>>>>>>>>>>          };
>>>>>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> For testing purposes, a possible workaround could be to have a cache that maps the callback address to a method handle that has the address bound to the first parameter. Assuming readdir always gets the same filler callback address, the same MethodHandle will be reused and eventually customized which means the callback address will become constant, and the downcall should then be intrinsified.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine to test this, but if you want to try it out, the patch should be this:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> diff --git a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>>>>> --- a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>> +++ b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>>>>   package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>>>>   import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>   import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>>>>   import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>>>>   public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>           return RuntimeHelper.upcallStub(fuse_fill_dir_t.class, fi, constants$0.fuse_fill_dir_t$FUNC, "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", scope);
>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>       static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>> -        return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>>>>> -            try {
>>>>>>>>>>>>>>>>>>> -                return (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>> -            } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>> -                throw new AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>>>>>> -        };
>>>>>>>>>>>>>>>>>>> +        class CacheHolder {
>>>>>>>>>>>>>>>>>>> +            static final Map<MemoryAddress, fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>>>> +        return CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>>>>>>>>>>>>>> +            final MethodHandle target = MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 0, addrK);
>>>>>>>>>>>>>>>>>>> +            return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>>>>> +                try {
>>>>>>>>>>>>>>>>>>> +                    return (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>> +                } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>> +                    throw new AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>> +                }
>>>>>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>>>>>> +        });
>>>>>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too much by line wrapping)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark test, that includes several down- and upcalls. First, let me explain, what I'm testing here:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, mostly for experimental purposes right now, and I'm trying to beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> While there are some other interesting metrics, such as read/write performance (both sequentially and random access), I focused on directory listings for now. Directory listings are the most complex operation in regards to the number of down- and upcalls:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback function
>>>>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in the directory
>>>>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no longer required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces additional noise (such as readxattr and trying to access files that I didn't report in readdir))
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this: `Files.list(Path.of("/Volumes/foo")).close();` with the volume reporting eight files [2]. When mounting with debug logs enabled, I can see that the exact same operations in the same order are invoked on both fuse-jnr and fuse-panama. One single dir listing results in 2 readdir upcalls, 10 callback downcalls, 16 getattr upcalls. There are also 8 getxattr calls and 16 lookup calls, however they don't reach Java, as the FUSE kernel knows they are not implemented.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>> Benchmark                        Mode  Cnt    Score Error  Units
>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5   66,569 ± 3,128  us/op
>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5  189,340 ± 4,275  us/op
>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit 42e03fd7c6a built with: `configure --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ --with-native-debug-symbols=none --with-debug-level=release --with-libclang=/usr/local/opt/llvm --with-libclang-version=12`
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from. Maybe creating a newConfinedScope() during each upcall [3] is "too much"? Maybe JNR is just negligently skipping some memory boundary checks to be faster. The results are not terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ >
>>>>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ >
>>>>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ >