Real-Life Benchmark for FUSE's readdir()

Thu Jul 15 12:40:21 UTC 2021

I believe that it would be more useful to try to run the perfasm 
profiler with JMH.

This can be done relatively easily, at least on linux, if you pass the 
argument `-prof perfasm` to JMH. (this would need hsdid-amd64.so on 
Linux to print readable assembly).

Another thing worth checking is allocation rate: `-prof gc`.

Maurizio

On 15/07/2021 12:30, Sebastian Stenzel wrote:
> Ok it really seems like VisualVM can't deal with these kinds of tasks 
> yet. Now it reports the String constructor being the culprit [1], 
> however I strongly doubt that, since this is probably one of the most 
> heavily optimized parts of the JDK.
>
> [1]: Screenshot on https://imgur.com/a/SHG8RSQ 
> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>
>
>> On 15. Jul 2021, at 13:11, Sebastian Stenzel 
>> <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>> wrote:
>>
>> Yes, must be a sampling error. Do you know of a (publicly available) 
>> _profiler_ that is compatible with JDK 17 / 18 already?
>>
>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore 
>>> <maurizio.cimadamore at oracle.com 
>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>
>>> Aha - it seems like you are seeing what I was seeing: unrolling now 
>>> seems to happen more reliably, which positively affect code like strlen.
>>>
>>> As for FUSE, I think the reason for the difference has probably 
>>> nothing to do with string conversion - the sampler profiler just 
>>> happens to hit that code a lot. I checked JNR code for string 
>>> conversion and I couldn't really find anything uber optimized in 
>>> that regard that could explain the gap.
>>>
>>> Probably something is not getting optimized as it should - likely a 
>>> downcall/upcall intrinsification is failing - maybe due to a subtle 
>>> issue with your code, or, possibly because you are hitting a 
>>> non-implemented case (e.g. we do not intrinsify calls which pass 
>>> arguments on the stack, yet), or because of some other bug.
>>>
>>> Maurizio
>>>
>>>
>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>> Wow, I stand corrected. I just re-ran the benchmark and 
>>>> `benchmarkStrlenBase` just got a lot faster!! Your change in 
>>>> https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a 
>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$> DID 
>>>> have an affect after all.
>>>>
>>>> Just doesn't impress FUSE very much...
>>>>
>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel 
>>>>> <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>> 
>>>>> wrote:
>>>>>
>>>>> Yup, I tried the int-approach as well, but with worse results... 
>>>>> Here is the full test: 
>>>>> https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6 
>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>
>>>>>
>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore 
>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>
>>>>>> Ok. Thanks.
>>>>>>
>>>>>> I tried similar experiments where instead of reading 4 bytes 
>>>>>> separately I'd read a single int value, and then use shifts and 
>>>>>> bitmasking to check for terminators. On paper good, but benchmark 
>>>>>> results were always worse than the version we have now (at least 
>>>>>> on Linux).
>>>>>>
>>>>>> That said, if you could please share the full string benchmark 
>>>>>> you have, that'd be helpful, so we can take a look at that, and 
>>>>>> see what's going wrong (ideally, C2 should be the one doing 
>>>>>> unrolling).
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>> I just did a quick snythetic test on a "manually unrolled" 
>>>>>>> strlen() without any FUSE context.
>>>>>>>
>>>>>>> I experimented with an implementation that looked like the 
>>>>>>> following and benchmarked it using a 259 byte memory segment 
>>>>>>> containing a 239 byte string (null byte at index 240):
>>>>>>>
>>>>>>> ```
>>>>>>> private static int strlenUnroll4(MemorySegment segment, long 
>>>>>>> start) {
>>>>>>> int offset;
>>>>>>> for (offset = 0; offset < segment.byteSize()-3; offset+=4) {
>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + offset + 0);
>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + offset + 1);
>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + offset + 2);
>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + offset + 3);
>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is this even 
>>>>>>> faster than directly having 4 different branches?
>>>>>>> if (b0 == 0) {
>>>>>>> return offset;
>>>>>>> } else if (b1 == 0) {
>>>>>>> return offset + 1;
>>>>>>> } else if (b2 == 0) {
>>>>>>> return offset + 2;
>>>>>>> } else if (b3 == 0) {
>>>>>>> return offset + 3;
>>>>>>> }
>>>>>>> }
>>>>>>> }
>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no loop 
>>>>>>> required for the remaining <4 bytes?
>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + offset);
>>>>>>> if (b == 0) {
>>>>>>> return offset;
>>>>>>> }
>>>>>>> }
>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>> }
>>>>>>> ```
>>>>>>>
>>>>>>> I'm not even sure how reliable my results are, since I have no 
>>>>>>> clue about how branch prediction works here... Neither have I 
>>>>>>> tested the correctness of this implementation.
>>>>>>>
>>>>>>>
>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore 
>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>
>>>>>>>> Thanks for reporting back.
>>>>>>>>
>>>>>>>> We probably need to investigate this a bit more deeply and try 
>>>>>>>> and reproduce on our side.
>>>>>>>>
>>>>>>>> One last question: you said that with manual unrolling you 
>>>>>>>> managed to get 2x faster: did you mean that string conversion 
>>>>>>>> got 2x faster or that you actually saw your FUSE benchmark 
>>>>>>>> going 2x faster because of the manual unrolling with strings?
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>> That, surprisingly, didn't change anything either. But don't 
>>>>>>>>> worry too much, the performance isn't bad (in absolute 
>>>>>>>>> figures) and it is by far not the only reason why I consider 
>>>>>>>>> panama the best solution to create java bindings for c libs.
>>>>>>>>>
>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore 
>>>>>>>>>> <maurizio.cimadamore at oracle.com 
>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>
>>>>>>>>>> Actually, after some bisecting, I found out that the 
>>>>>>>>>> performance of converting a memory segment into a string 
>>>>>>>>>> jumped 2x faster with this fix:
>>>>>>>>>>
>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$ 
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>
>>>>>>>>>>
>>>>>>>>>> Which was integrated after the one I originally pointed at. 
>>>>>>>>>> They both seem to touch loop optimization in case of 
>>>>>>>>>> overflows, which the strlen code is triggering (since the 
>>>>>>>>>> loop limit checks for loop variable being positive).
>>>>>>>>>>
>>>>>>>>>> This is a simple patch which adds a string conversion test:
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>> diff --git 
>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java 
>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>> --- 
>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>> +++ 
>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, 
>>>>>>>>>> true));
>>>>>>>>>>      }
>>>>>>>>>>
>>>>>>>>>> +    MemorySegment segment;
>>>>>>>>>> +
>>>>>>>>>>      @Setup
>>>>>>>>>>      public void setup() {
>>>>>>>>>>          str = makeString(size);
>>>>>>>>>>          segmentAllocator = 
>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size 
>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>> +        segment = toCString(str, segmentAllocator);
>>>>>>>>>>      }
>>>>>>>>>>
>>>>>>>>>>      @TearDown
>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>          scope.close();
>>>>>>>>>>      }
>>>>>>>>>>
>>>>>>>>>> +    @Benchmark
>>>>>>>>>> +    public String panama_str_conv() throws Throwable {
>>>>>>>>>> +        return CLinker.toJavaString(segment);
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>>      @Benchmark
>>>>>>>>>>      public int jni_strlen() throws Throwable {
>>>>>>>>>>          return strlen(str);
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>> Benchmark                   (size)  Mode  Cnt    Score 
>>>>>>>>>>   Error Units
>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  106.613 ? 
>>>>>>>>>> 7.060 ns/op
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>
>>>>>>>>>> ```
>>>>>>>>>> Benchmark                   (size)  Mode  Cnt   Score   Error 
>>>>>>>>>> Units
>>>>>>>>>> StrLenTest.panama_str_conv     100  avgt   30  48.120 ? 0.557 
>>>>>>>>>> ns/op
>>>>>>>>>> ```
>>>>>>>>>>
>>>>>>>>>> So, as you can see, a pretty sizeable jump. Eyeballing, the 
>>>>>>>>>> shape of generated code doesn't look too different, which 
>>>>>>>>>> makes me think of another case where loop is unrolled, but 
>>>>>>>>>> main loop never executed (similar to JDK-8269230), but we'll 
>>>>>>>>>> need to look deeper.
>>>>>>>>>>
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>
>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for details 
>>>>>>>>>>>> how I built the JDK, see my initial email). Maybe I'm 
>>>>>>>>>>>> missing some compiler flags to enable all optimizations?
>>>>>>>>>>> I see - you do have the latest panama changes, but there has 
>>>>>>>>>>> been a sync with upstream after that changeset, I believe - 
>>>>>>>>>>> can you please try to resync with the latest 
>>>>>>>>>>> foreign-jextract commit - which should be:
>>>>>>>>>>>
>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$ 
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Are you sure about loop vectorization being applied to 
>>>>>>>>>>>> strlen? I'm not an expert on this field, but I had the 
>>>>>>>>>>>> impression this wasn't possible when the loop terminates 
>>>>>>>>>>>> "from within".
>>>>>>>>>>> Vlad is the expert here - when chatting offline he did 
>>>>>>>>>>> mention that loop should have single exit - which I guess 
>>>>>>>>>>> also takes into account the "normal" exit - so the strlen 
>>>>>>>>>>> routine would seem to have two exits...
>>>>>>>>>>>
>>>>>>>>>>> Maurizio
>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>
>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore 
>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>> thanks for sharing your findings - I've done some attempts 
>>>>>>>>>>>>> here with a targeted microbenchmark which measures the 
>>>>>>>>>>>>> performance of string conversion and I'm seeing unrolling 
>>>>>>>>>>>>> and vectorization being applied on the strlen computation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not been 
>>>>>>>>>>>>> updated in the last few weeks? There has been a C2 
>>>>>>>>>>>>> optimization fix which has been added recently, which I 
>>>>>>>>>>>>> think might be related to this:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> good idea, but it makes no difference beyond statistical 
>>>>>>>>>>>>>> error.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I started sampling the application with VisualVM (which 
>>>>>>>>>>>>>> is quite hard, since native threads are extremely 
>>>>>>>>>>>>>> short-lived. What I noticed is, that regardless of where 
>>>>>>>>>>>>>> the sampler interrupts a thread, in nearly all cases 100% 
>>>>>>>>>>>>>> of CPU time are caused by 
>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() 
>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to the 
>>>>>>>>>>>>>> nature of null termination, but maybe we can make use of 
>>>>>>>>>>>>>> the fact that we're dealing with MemorySegments here: 
>>>>>>>>>>>>>> Since they protect us from overflows, maybe there is no 
>>>>>>>>>>>>>> need to look at only a single byte at a time. Maybe the 
>>>>>>>>>>>>>> strlen()-loop can be unrolled or even be vectorized.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup when 
>>>>>>>>>>>>>> doing a x4 loop unroll.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee 
>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code, one 
>>>>>>>>>>>>>>> possible explanation for the discrepancy I can think of 
>>>>>>>>>>>>>>> is that the DirFiller ends up using virtual downcalls to 
>>>>>>>>>>>>>>> do it's work, which are currently not intrinsified. 
>>>>>>>>>>>>>>> Being mostly a case of 'not implemented yet', i.e. it is 
>>>>>>>>>>>>>>> a known issue.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>      static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>          return (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>              try {
>>>>>>>>>>>>>>>                  return 
>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is not a 
>>>>>>>>>>>>>>> constant, so the call is virtual
>>>>>>>>>>>>>>>              } catch (Throwable ex$) {
>>>>>>>>>>>>>>>                  throw new AssertionError("should not 
>>>>>>>>>>>>>>> reach here", ex$);
>>>>>>>>>>>>>>>              }
>>>>>>>>>>>>>>>          };
>>>>>>>>>>>>>>>      }
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For testing purposes, a possible workaround could be to 
>>>>>>>>>>>>>>> have a cache that maps the callback address to a method 
>>>>>>>>>>>>>>> handle that has the address bound to the first 
>>>>>>>>>>>>>>> parameter. Assuming readdir always gets the same filler 
>>>>>>>>>>>>>>> callback address, the same MethodHandle will be reused 
>>>>>>>>>>>>>>> and eventually customized which means the callback 
>>>>>>>>>>>>>>> address will become constant, and the downcall should 
>>>>>>>>>>>>>>> then be intrinsified.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I don't currently have access to a Mac machine to test 
>>>>>>>>>>>>>>> this, but if you want to try it out, the patch should be 
>>>>>>>>>>>>>>> this:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>> diff --git 
>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java 
>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>> --- 
>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>> +++ 
>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>   package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>   import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>   import java.nio.ByteOrder;
>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>   import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>   import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>   public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>           return 
>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class, fi, 
>>>>>>>>>>>>>>> constants$0.fuse_fill_dir_t$FUNC, 
>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", 
>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>       static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>> -        return (jdk.incubator.foreign.MemoryAddress x0, 
>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>> -            try {
>>>>>>>>>>>>>>> -                return 
>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, 
>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>> -            } catch (Throwable ex$) {
>>>>>>>>>>>>>>> -                throw new AssertionError("should not 
>>>>>>>>>>>>>>> reach here", ex$);
>>>>>>>>>>>>>>> -            }
>>>>>>>>>>>>>>> -        };
>>>>>>>>>>>>>>> +        class CacheHolder {
>>>>>>>>>>>>>>> +            static final Map<MemoryAddress, 
>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>> +        return CacheHolder.CACHE.computeIfAbsent(addr, 
>>>>>>>>>>>>>>> addrK -> {
>>>>>>>>>>>>>>> +            final MethodHandle target = 
>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 
>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>> +            return (jdk.incubator.foreign.MemoryAddress 
>>>>>>>>>>>>>>> x0, jdk.incubator.foreign.MemoryAddress x1, 
>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>> +                try {
>>>>>>>>>>>>>>> +                    return (int)target.invokeExact(x0, 
>>>>>>>>>>>>>>> x1, x2, x3);
>>>>>>>>>>>>>>> +                } catch (Throwable ex$) {
>>>>>>>>>>>>>>> +                    throw new AssertionError("should 
>>>>>>>>>>>>>>> not reach here", ex$);
>>>>>>>>>>>>>>> +                }
>>>>>>>>>>>>>>> +            };
>>>>>>>>>>>>>>> +        });
>>>>>>>>>>>>>>>       }
>>>>>>>>>>>>>>>   }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too much by 
>>>>>>>>>>>>>>> line wrapping)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark test, that 
>>>>>>>>>>>>>>>> includes several down- and upcalls. First, let me 
>>>>>>>>>>>>>>>> explain, what I'm testing here:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, mostly for 
>>>>>>>>>>>>>>>> experimental purposes right now, and I'm trying to beat 
>>>>>>>>>>>>>>>> fuse-jnr [1].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> While there are some other interesting metrics, such as 
>>>>>>>>>>>>>>>> read/write performance (both sequentially and random 
>>>>>>>>>>>>>>>> access), I focused on directory listings for now. 
>>>>>>>>>>>>>>>> Directory listings are the most complex operation in 
>>>>>>>>>>>>>>>> regards to the number of down- and upcalls:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback function
>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in the 
>>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no longer 
>>>>>>>>>>>>>>>> required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces additional 
>>>>>>>>>>>>>>>> noise (such as readxattr and trying to access files 
>>>>>>>>>>>>>>>> that I didn't report in readdir))
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So, what I'm testing is essentially this: 
>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();` with the 
>>>>>>>>>>>>>>>> volume reporting eight files [2]. When mounting with 
>>>>>>>>>>>>>>>> debug logs enabled, I can see that the exact same 
>>>>>>>>>>>>>>>> operations in the same order are invoked on both 
>>>>>>>>>>>>>>>> fuse-jnr and fuse-panama. One single dir listing 
>>>>>>>>>>>>>>>> results in 2 readdir upcalls, 10 callback downcalls, 16 
>>>>>>>>>>>>>>>> getattr upcalls. There are also 8 getxattr calls and 16 
>>>>>>>>>>>>>>>> lookup calls, however they don't reach Java, as the 
>>>>>>>>>>>>>>>> FUSE kernel knows they are not implemented.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>> Benchmark                        Mode  Cnt    Score 
>>>>>>>>>>>>>>>> Error  Units
>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5   66,569 ± 
>>>>>>>>>>>>>>>> 3,128  us/op
>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5  189,340 ± 
>>>>>>>>>>>>>>>> 4,275  us/op
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've been using panama snapshot at commit 42e03fd7c6a 
>>>>>>>>>>>>>>>> built with: `configure 
>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ 
>>>>>>>>>>>>>>>> --with-native-debug-symbols=none 
>>>>>>>>>>>>>>>> --with-debug-level=release 
>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm 
>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can't tell where this overhead comes from. Maybe 
>>>>>>>>>>>>>>>> creating a newConfinedScope() during each upcall [3] is 
>>>>>>>>>>>>>>>> "too much"? Maybe JNR is just negligently skipping some 
>>>>>>>>>>>>>>>> memory boundary checks to be faster. The results are 
>>>>>>>>>>>>>>>> not terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ 
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ 
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ 
>>>>>>>>>>>>>>>> >
>>>>>
>>>>
>>
>