Real-Life Benchmark for FUSE's readdir()
Sebastian Stenzel
sebastian.stenzel at gmail.com
Sat Jul 10 14:58:58 UTC 2021
Hi,
good idea, but it makes no difference beyond statistical error.
I started sampling the application with VisualVM (which is quite hard, since native threads are extremely short-lived. What I noticed is, that regardless of where the sampler interrupts a thread, in nearly all cases 100% of CPU time are caused by jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal() → jdk.internal.foreign.abi.SharedUtils.strlen().
I know that strlen can hardly be optimized due to the nature of null termination, but maybe we can make use of the fact that we're dealing with MemorySegments here: Since they protect us from overflows, maybe there is no need to look at only a single byte at a time. Maybe the strlen()-loop can be unrolled or even be vectorized.
I just did a quick test and observed a x2 speedup when doing a x4 loop unroll.
Cheers,
Sebastian
> On 9. Jul 2021, at 20:30, Jorn Vernee <jorn.vernee at oracle.com> wrote:
>
> Hi Sebastian,
>
> Thanks for testing this. Looking at your code, one possible explanation for the discrepancy I can think of is that the DirFiller ends up using virtual downcalls to do it's work, which are currently not intrinsified. Being mostly a case of 'not implemented yet', i.e. it is a known issue.
>
> ```
> static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
> return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
> try {
> return (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, x0, x1, x2, x3); // <--------- 'addr' here is not a constant, so the call is virtual
> } catch (Throwable ex$) {
> throw new AssertionError("should not reach here", ex$);
> }
> };
> }
> ```
>
> For testing purposes, a possible workaround could be to have a cache that maps the callback address to a method handle that has the address bound to the first parameter. Assuming readdir always gets the same filler callback address, the same MethodHandle will be reused and eventually customized which means the callback address will become constant, and the downcall should then be intrinsified.
>
> I don't currently have access to a Mac machine to test this, but if you want to try it out, the patch should be this:
>
> ```
> diff --git a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
> index bfd4655..4c68d4c 100644
> --- a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
> +++ b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
> @@ -3,8 +3,12 @@
> package de.skymatic.fusepanama.lowlevel;
>
> import java.lang.invoke.MethodHandle;
> +import java.lang.invoke.MethodHandles;
> import java.lang.invoke.VarHandle;
> import java.nio.ByteOrder;
> +import java.util.Map;
> +import java.util.concurrent.ConcurrentHashMap;
> +
> import jdk.incubator.foreign.*;
> import static jdk.incubator.foreign.CLinker.*;
> public interface fuse_fill_dir_t {
> @@ -17,13 +21,19 @@ public interface fuse_fill_dir_t {
> return RuntimeHelper.upcallStub(fuse_fill_dir_t.class, fi, constants$0.fuse_fill_dir_t$FUNC, "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I", scope);
> }
> static fuse_fill_dir_t ofAddress(MemoryAddress addr) {
> - return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
> - try {
> - return (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr, x0, x1, x2, x3);
> - } catch (Throwable ex$) {
> - throw new AssertionError("should not reach here", ex$);
> - }
> - };
> + class CacheHolder {
> + static final Map<MemoryAddress, fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
> + }
> + return CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
> + final MethodHandle target = MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH, 0, addrK);
> + return (jdk.incubator.foreign.MemoryAddress x0, jdk.incubator.foreign.MemoryAddress x1, jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
> + try {
> + return (int)target.invokeExact(x0, x1, x2, x3);
> + } catch (Throwable ex$) {
> + throw new AssertionError("should not reach here", ex$);
> + }
> + };
> + });
> }
> }
>
>
> ```
> (I hope these code blocks don't get mangled too much by line wrapping)
>
> HTH,
> Jorn
>
> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>> Hi,
>>
>> I wanted to share the results of a benchmark test, that includes several down- and upcalls. First, let me explain, what I'm testing here:
>>
>> I'm working on a panama-based FUSE binding, mostly for experimental purposes right now, and I'm trying to beat fuse-jnr [1].
>>
>> While there are some other interesting metrics, such as read/write performance (both sequentially and random access), I focused on directory listings for now. Directory listings are the most complex operation in regards to the number of down- and upcalls:
>>
>> 1. FUSE upcalls readdir and provides a callback function
>> 2. java downcalls the callback for each item in the directory
>> 3. FUSE upcalls getattr for each item (no longer required with "readdirplus" in FUSE 3.x)
>> (4. I'm testing on macOS, which introduces additional noise (such as readxattr and trying to access files that I didn't report in readdir))
>>
>> So, what I'm testing is essentially this: `Files.list(Path.of("/Volumes/foo")).close();` with the volume reporting eight files [2]. When mounting with debug logs enabled, I can see that the exact same operations in the same order are invoked on both fuse-jnr and fuse-panama. One single dir listing results in 2 readdir upcalls, 10 callback downcalls, 16 getattr upcalls. There are also 8 getxattr calls and 16 lookup calls, however they don't reach Java, as the FUSE kernel knows they are not implemented.
>>
>> Long story short, here are the results:
>>
>> ```
>> Benchmark Mode Cnt Score Error Units
>> BenchmarkTest.testListDirJnr avgt 5 66,569 ± 3,128 us/op
>> BenchmarkTest.testListDirPanama avgt 5 189,340 ± 4,275 us/op
>> ```
>>
>> I've been using panama snapshot at commit 42e03fd7c6a built with: `configure --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/ --with-native-debug-symbols=none --with-debug-level=release --with-libclang=/usr/local/opt/llvm --with-libclang-version=12`
>>
>> I can't tell where this overhead comes from. Maybe creating a newConfinedScope() during each upcall [3] is "too much"? Maybe JNR is just negligently skipping some memory boundary checks to be faster. The results are not terrible, but I'd hoped for something better.
>>
>> Sebastian
>>
>> [1]https://github.com/SerCeMan/jnr-fuse <https://github.com/SerCeMan/jnr-fuse>
>> [2]https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java#L139-L146 <https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java#L139-L146>
>> [3]https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java#L67-L71 <https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java#L67-L71>
More information about the panama-dev
mailing list