Real-Life Benchmark for FUSE's readdir()

Maurizio Cimadamore maurizio.cimadamore at oracle.com
Thu Jul 15 10:18:02 UTC 2021


Thanks for reporting back.

We probably need to investigate this a bit more deeply and try and 
reproduce on our side.

One last question: you said that with manual unrolling you managed to 
get 2x faster: did you mean that string conversion got 2x faster or that 
you actually saw your FUSE benchmark going 2x faster because of the 
manual unrolling with strings?

Maurizio

On 15/07/2021 11:03, Sebastian Stenzel wrote:
> That, surprisingly, didn't change anything either. But don't worry too much, the performance isn't bad (in absolute figures) and it is by far not the only reason why I consider panama the best solution to create java bindings for c libs.
>
>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>
>> Actually, after some bisecting, I found out that the performance of converting a memory segment into a string jumped 2x faster with this fix:
>>
>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$
>>
>> Which was integrated after the one I originally pointed at. They both seem to touch loop optimization in case of overflows, which the strlen code is triggering (since the loop limit checks for loop variable being positive).
>>
>> This is a simple patch which adds a string conversion test:
>>
>> ```
>> diff --git a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>> index ec4da5ffc88..5b3fb1a2b2a 100644
>> --- a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>> +++ b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>> @@ -93,10 +93,13 @@ public class StrLenTest {
>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME, true));
>>       }
>>
>> +    MemorySegment segment;
>> +
>>       @Setup
>>       public void setup() {
>>           str = makeString(size);
>>           segmentAllocator = SegmentAllocator.ofSegment(MemorySegment.allocateNative(size + 1, ResourceScope.newImplicitScope()));
>> +        segment = toCString(str, segmentAllocator);
>>       }
>>
>>       @TearDown
>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>           scope.close();
>>       }
>>
>> +    @Benchmark
>> +    public String panama_str_conv() throws Throwable {
>> +        return CLinker.toJavaString(segment);
>> +    }
>> +
>>       @Benchmark
>>       public int jni_strlen() throws Throwable {
>>           return strlen(str);
>> ```
>>
>> Before the above fix, the numbers are as follows:
>>
>> ```
>> Benchmark                   (size)  Mode  Cnt    Score   Error Units
>> StrLenTest.panama_str_conv     100  avgt   30  106.613 ? 7.060 ns/op
>> ```
>>
>> While after the fix I get this:
>>
>> ```
>> Benchmark                   (size)  Mode  Cnt   Score   Error Units
>> StrLenTest.panama_str_conv     100  avgt   30  48.120 ? 0.557 ns/op
>> ```
>>
>> So, as you can see, a pretty sizeable jump. Eyeballing, the shape of generated code doesn't look too different, which makes me think of another case where loop is unrolled, but main loop never executed (similar to JDK-8269230), but we'll need to look deeper.
>>
>> Maurizio
>>
>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>> Hey Maurizio,
>>>>
>>>> All tests have been done on commit 42e03fd7c6a (for details how I built the JDK, see my initial email). Maybe I'm missing some compiler flags to enable all optimizations?
>>> I see - you do have the latest panama changes, but there has been a sync with upstream after that changeset, I believe - can you please try to resync with the latest foreign-jextract commit - which should be:
>>>
>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$
>>>
>>>
>>>> Are you sure about loop vectorization being applied to strlen? I'm not an expert on this field, but I had the impression this wasn't possible when the loop terminates "from within".
>>> Vlad is the expert here - when chatting offline he did mention that loop should have single exit - which I guess also takes into account the "normal" exit - so the strlen routine would seem to have two exits...
>>>
>>> Maurizio
>>>
>>>> Best regards,
>>>> Sebastian
>>>>
>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>>>>
>>>>> Hi Sebastian,
>>>>> thanks for sharing your findings - I've done some attempts here with a targeted microbenchmark which measures the performance of string conversion and I'm seeing unrolling and vectorization being applied on the strlen computation.
>>>>>
>>>>> May I ask if, by any chance, your HEAD has not been updated in the last few weeks? There has been a C2 optimization fix which has been added recently, which I think might be related to this:
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>
>>>>> Do you have this fix in the JDK you are using?
>>>>>
>>>>> Thanks
>>>>> Maurizio
>>>>>
>>>>>
>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>> Hi,
>>>>>>
>>>>>> good idea, but it makes no difference beyond statistical error.
>>>>>>
>>>>>> I started sampling the application with VisualVM (which is quite hard, since native threads are extremely short-lived. What I noticed is, that regardless of where the sampler interrupts a thread, in nearly all cases 100% of CPU time are caused by jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>
>>>>>> I know that strlen can hardly be optimized due to the nature of null termination, but maybe we can make use of the fact that we're dealing with MemorySegments here: Since they protect us from overflows, maybe there is no need to look at only a single byte at a time. Maybe the strlen()-loop can be unrolled or even be vectorized.
>>>>>>
>>>>>> I just did a quick test and observed a x2 speedup when doing a x4 loop unroll.
>>>>>>
>>>>>> Cheers,
>>>>>> Sebastian
>>>>>>
>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee <jorn.vernee at oracle.com> wrote:
>>>>>>>
>>>>>>> Hi Sebastian,
>>>>>>>
>>>>>>> Thanks for testing this. Looking at your code, one possible explanation for the discrepancy I can think of is that the DirFiller ends up using virtual downcalls to do it's work, which are currently not intrinsified. Being mostly a case of 'not implemented yet', i.e. it is a known issue.
>>>>>>>
>>>>>>> ```
>>>>>>>       static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>>           return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>               try {
>>>>>>>                   return (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, x0, x1, x2, x3); // <--------- 'addr' here is not a constant, so the call is virtual
>>>>>>>               } catch (Throwable ex$) {
>>>>>>>                   throw new AssertionError("should not reach here", ex$);
>>>>>>>               }
>>>>>>>           };
>>>>>>>       }
>>>>>>> ```
>>>>>>>
>>>>>>> For testing purposes, a possible workaround could be to have a cache that maps the callback address to a method handle that has the address bound to the first parameter. Assuming readdir always gets the same filler callback address, the same MethodHandle will be reused and eventually customized which means the callback address will become constant, and the downcall should then be intrinsified.
>>>>>>>
>>>>>>> I don't currently have access to a Mac machine to test this, but if you want to try it out, the patch should be this:
>>>>>>>
>>>>>>> ```
>>>>>>> diff --git a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>> --- a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>> +++ b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>    package de.skymatic.fusepanama.lowlevel;
>>>>>>>
>>>>>>>    import java.lang.invoke.MethodHandle;
>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>    import java.lang.invoke.VarHandle;
>>>>>>>    import java.nio.ByteOrder;
>>>>>>> +import java.util.Map;
>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>> +
>>>>>>>    import jdk.incubator.foreign.*;
>>>>>>>    import static jdk.incubator.foreign.CLinker.*;
>>>>>>>    public interface fuse_fill_dir_t {
>>>>>>> @@ -17,13 +21,19 @@ public interface fuse_fill_dir_t {
>>>>>>>            return RuntimeHelper.upcallStub(fuse_fill_dir_t.class, fi, constants$0.fuse_fill_dir_t$FUNC, "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", scope);
>>>>>>>        }
>>>>>>>        static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
>>>>>>> -        return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>> -            try {
>>>>>>> -                return (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, x0, x1, x2, x3);
>>>>>>> -            } catch (Throwable ex$) {
>>>>>>> -                throw new AssertionError("should not reach here", ex$);
>>>>>>> -            }
>>>>>>> -        };
>>>>>>> +        class CacheHolder {
>>>>>>> +            static final Map<MemoryAddress, fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>> +        }
>>>>>>> +        return CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>> +            final MethodHandle target = MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 0, addrK);
>>>>>>> +            return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>> +                try {
>>>>>>> +                    return (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>> +                } catch (Throwable ex$) {
>>>>>>> +                    throw new AssertionError("should not reach here", ex$);
>>>>>>> +                }
>>>>>>> +            };
>>>>>>> +        });
>>>>>>>        }
>>>>>>>    }
>>>>>>>
>>>>>>>
>>>>>>> ```
>>>>>>> (I hope these code blocks don't get mangled too much by line wrapping)
>>>>>>>
>>>>>>> HTH,
>>>>>>> Jorn
>>>>>>>
>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I wanted to share the results of a benchmark test, that includes several down- and upcalls. First, let me explain, what I'm testing here:
>>>>>>>>
>>>>>>>> I'm working on a panama-based FUSE binding, mostly for experimental purposes right now, and I'm trying to beat fuse-jnr [1].
>>>>>>>>
>>>>>>>> While there are some other interesting metrics, such as read/write performance (both sequentially and random access), I focused on directory listings for now. Directory listings are the most complex operation in regards to the number of down- and upcalls:
>>>>>>>>
>>>>>>>> 1. FUSE upcalls readdir and provides a callback function
>>>>>>>> 2. java downcalls the callback for each item in the directory
>>>>>>>> 3. FUSE upcalls getattr for each item (no longer required with "readdirplus" in FUSE 3.x)
>>>>>>>> (4. I'm testing on macOS, which introduces additional noise (such as readxattr and trying to access files that I didn't report in readdir))
>>>>>>>>
>>>>>>>> So, what I'm testing is essentially this: `Files.list(Path.of("/Volumes/foo")).close();` with the volume reporting eight files [2]. When mounting with debug logs enabled, I can see that the exact same operations in the same order are invoked on both fuse-jnr and fuse-panama. One single dir listing results in 2 readdir upcalls, 10 callback downcalls, 16 getattr upcalls. There are also 8 getxattr calls and 16 lookup calls, however they don't reach Java, as the FUSE kernel knows they are not implemented.
>>>>>>>>
>>>>>>>> Long story short, here are the results:
>>>>>>>>
>>>>>>>> ```
>>>>>>>> Benchmark                        Mode  Cnt    Score Error  Units
>>>>>>>> BenchmarkTest.testListDirJnr     avgt    5   66,569 ± 3,128  us/op
>>>>>>>> BenchmarkTest.testListDirPanama  avgt    5  189,340 ± 4,275  us/op
>>>>>>>> ```
>>>>>>>>
>>>>>>>> I've been using panama snapshot at commit 42e03fd7c6a built with: `configure --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ --with-native-debug-symbols=none --with-debug-level=release --with-libclang=/usr/local/opt/llvm --with-libclang-version=12`
>>>>>>>>
>>>>>>>> I can't tell where this overhead comes from. Maybe creating a newConfinedScope() during each upcall [3] is "too much"? Maybe JNR is just negligently skipping some memory boundary checks to be faster. The results are not terrible, but I'd hoped for something better.
>>>>>>>>
>>>>>>>> Sebastian
>>>>>>>>
>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$ >
>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$ >
>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$ >


More information about the panama-dev mailing list