Real-Life Benchmark for FUSE's readdir()
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Jul 15 15:55:34 UTC 2021
I tried to reproduce here on Linux, but with no luck - in the sense that
I'm not super sure on how to run the benchmark.
I'm able somehow to run the two examples - and I noted that here JNR
works ok, while the Panama one doesn't seem to mount things correctly -
a new mount appears on my file explorer, but I'm unable to do anything
with it (even unmount - which can only be done at sudo level).
When working with the JNR support, the mount works fine, it shows in the
file explorer, I can click on that location and browse, and then unmount
from there. Everything works.
That said, the benchmarks require the mount points to be up and running
- so I've tried first to execute the example (e.g. JNR) and then run the
benchmark in two separate terminal windows, all via Maven - but the
benchmark doesn't seem to do anything (I've uncommented the benchmarks
of course).
How do you run them?
Thanks
Maurizio
On 15/07/2021 13:40, Maurizio Cimadamore wrote:
> I believe that it would be more useful to try to run the perfasm
> profiler with JMH.
>
> This can be done relatively easily, at least on linux, if you pass the
> argument `-prof perfasm` to JMH. (this would need hsdid-amd64.so on
> Linux to print readable assembly).
>
> Another thing worth checking is allocation rate: `-prof gc`.
>
> Maurizio
>
> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>> Ok it really seems like VisualVM can't deal with these kinds of tasks
>> yet. Now it reports the String constructor being the culprit [1],
>> however I strongly doubt that, since this is probably one of the most
>> heavily optimized parts of the JDK.
>>
>> [1]: Screenshot on https://imgur.com/a/SHG8RSQ
>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>
>>
>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel
>>> <sebastian.stenzel at gmail.com <mailto:sebastian.stenzel at gmail.com>>
>>> wrote:
>>>
>>> Yes, must be a sampling error. Do you know of a (publicly available)
>>> _profiler_ that is compatible with JDK 17 / 18 already?
>>>
>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore
>>>> <maurizio.cimadamore at oracle.com
>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>
>>>> Aha - it seems like you are seeing what I was seeing: unrolling now
>>>> seems to happen more reliably, which positively affect code like
>>>> strlen.
>>>>
>>>> As for FUSE, I think the reason for the difference has probably
>>>> nothing to do with string conversion - the sampler profiler just
>>>> happens to hit that code a lot. I checked JNR code for string
>>>> conversion and I couldn't really find anything uber optimized in
>>>> that regard that could explain the gap.
>>>>
>>>> Probably something is not getting optimized as it should - likely a
>>>> downcall/upcall intrinsification is failing - maybe due to a subtle
>>>> issue with your code, or, possibly because you are hitting a
>>>> non-implemented case (e.g. we do not intrinsify calls which pass
>>>> arguments on the stack, yet), or because of some other bug.
>>>>
>>>> Maurizio
>>>>
>>>>
>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>> Wow, I stand corrected. I just re-ran the benchmark and
>>>>> `benchmarkStrlenBase` just got a lot faster!! Your change in
>>>>> https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a
>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$> DID
>>>>> have an affect after all.
>>>>>
>>>>> Just doesn't impress FUSE very much...
>>>>>
>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel
>>>>>> <sebastian.stenzel at gmail.com
>>>>>> <mailto:sebastian.stenzel at gmail.com>> wrote:
>>>>>>
>>>>>> Yup, I tried the int-approach as well, but with worse results...
>>>>>> Here is the full test:
>>>>>> https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6
>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>
>>>>>>
>>>>>>
>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore
>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>
>>>>>>> Ok. Thanks.
>>>>>>>
>>>>>>> I tried similar experiments where instead of reading 4 bytes
>>>>>>> separately I'd read a single int value, and then use shifts and
>>>>>>> bitmasking to check for terminators. On paper good, but
>>>>>>> benchmark results were always worse than the version we have now
>>>>>>> (at least on Linux).
>>>>>>>
>>>>>>> That said, if you could please share the full string benchmark
>>>>>>> you have, that'd be helpful, so we can take a look at that, and
>>>>>>> see what's going wrong (ideally, C2 should be the one doing
>>>>>>> unrolling).
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>> I just did a quick snythetic test on a "manually unrolled"
>>>>>>>> strlen() without any FUSE context.
>>>>>>>>
>>>>>>>> I experimented with an implementation that looked like the
>>>>>>>> following and benchmarked it using a 259 byte memory segment
>>>>>>>> containing a 239 byte string (null byte at index 240):
>>>>>>>>
>>>>>>>> ```
>>>>>>>> private static int strlenUnroll4(MemorySegment segment, long
>>>>>>>> start) {
>>>>>>>> int offset;
>>>>>>>> for (offset = 0; offset < segment.byteSize()-3; offset+=4) {
>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start + offset
>>>>>>>> + 0);
>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start + offset
>>>>>>>> + 1);
>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start + offset
>>>>>>>> + 2);
>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start + offset
>>>>>>>> + 3);
>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is this even
>>>>>>>> faster than directly having 4 different branches?
>>>>>>>> if (b0 == 0) {
>>>>>>>> return offset;
>>>>>>>> } else if (b1 == 0) {
>>>>>>>> return offset + 1;
>>>>>>>> } else if (b2 == 0) {
>>>>>>>> return offset + 2;
>>>>>>>> } else if (b3 == 0) {
>>>>>>>> return offset + 3;
>>>>>>>> }
>>>>>>>> }
>>>>>>>> }
>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no loop
>>>>>>>> required for the remaining <4 bytes?
>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start + offset);
>>>>>>>> if (b == 0) {
>>>>>>>> return offset;
>>>>>>>> }
>>>>>>>> }
>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>> }
>>>>>>>> ```
>>>>>>>>
>>>>>>>> I'm not even sure how reliable my results are, since I have no
>>>>>>>> clue about how branch prediction works here... Neither have I
>>>>>>>> tested the correctness of this implementation.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore
>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks for reporting back.
>>>>>>>>>
>>>>>>>>> We probably need to investigate this a bit more deeply and try
>>>>>>>>> and reproduce on our side.
>>>>>>>>>
>>>>>>>>> One last question: you said that with manual unrolling you
>>>>>>>>> managed to get 2x faster: did you mean that string conversion
>>>>>>>>> got 2x faster or that you actually saw your FUSE benchmark
>>>>>>>>> going 2x faster because of the manual unrolling with strings?
>>>>>>>>>
>>>>>>>>> Maurizio
>>>>>>>>>
>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>> That, surprisingly, didn't change anything either. But don't
>>>>>>>>>> worry too much, the performance isn't bad (in absolute
>>>>>>>>>> figures) and it is by far not the only reason why I consider
>>>>>>>>>> panama the best solution to create java bindings for c libs.
>>>>>>>>>>
>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore
>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Actually, after some bisecting, I found out that the
>>>>>>>>>>> performance of converting a memory segment into a string
>>>>>>>>>>> jumped 2x faster with this fix:
>>>>>>>>>>>
>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Which was integrated after the one I originally pointed at.
>>>>>>>>>>> They both seem to touch loop optimization in case of
>>>>>>>>>>> overflows, which the strlen code is triggering (since the
>>>>>>>>>>> loop limit checks for loop variable being positive).
>>>>>>>>>>>
>>>>>>>>>>> This is a simple patch which adds a string conversion test:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> diff --git
>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>
>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>> ---
>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>
>>>>>>>>>>> +++
>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>
>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME,
>>>>>>>>>>> true));
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> + MemorySegment segment;
>>>>>>>>>>> +
>>>>>>>>>>> @Setup
>>>>>>>>>>> public void setup() {
>>>>>>>>>>> str = makeString(size);
>>>>>>>>>>> segmentAllocator =
>>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size
>>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>> + segment = toCString(str, segmentAllocator);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> @TearDown
>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>> scope.close();
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> + @Benchmark
>>>>>>>>>>> + public String panama_str_conv() throws Throwable {
>>>>>>>>>>> + return CLinker.toJavaString(segment);
>>>>>>>>>>> + }
>>>>>>>>>>> +
>>>>>>>>>>> @Benchmark
>>>>>>>>>>> public int jni_strlen() throws Throwable {
>>>>>>>>>>> return strlen(str);
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> Benchmark (size) Mode Cnt Score
>>>>>>>>>>> Error Units
>>>>>>>>>>> StrLenTest.panama_str_conv 100 avgt 30 106.613 ?
>>>>>>>>>>> 7.060 ns/op
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>
>>>>>>>>>>> ```
>>>>>>>>>>> Benchmark (size) Mode Cnt Score
>>>>>>>>>>> Error Units
>>>>>>>>>>> StrLenTest.panama_str_conv 100 avgt 30 48.120 ?
>>>>>>>>>>> 0.557 ns/op
>>>>>>>>>>> ```
>>>>>>>>>>>
>>>>>>>>>>> So, as you can see, a pretty sizeable jump. Eyeballing, the
>>>>>>>>>>> shape of generated code doesn't look too different, which
>>>>>>>>>>> makes me think of another case where loop is unrolled, but
>>>>>>>>>>> main loop never executed (similar to JDK-8269230), but we'll
>>>>>>>>>>> need to look deeper.
>>>>>>>>>>>
>>>>>>>>>>> Maurizio
>>>>>>>>>>>
>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>
>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a (for
>>>>>>>>>>>>> details how I built the JDK, see my initial email). Maybe
>>>>>>>>>>>>> I'm missing some compiler flags to enable all optimizations?
>>>>>>>>>>>> I see - you do have the latest panama changes, but there
>>>>>>>>>>>> has been a sync with upstream after that changeset, I
>>>>>>>>>>>> believe - can you please try to resync with the latest
>>>>>>>>>>>> foreign-jextract commit - which should be:
>>>>>>>>>>>>
>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$
>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Are you sure about loop vectorization being applied to
>>>>>>>>>>>>> strlen? I'm not an expert on this field, but I had the
>>>>>>>>>>>>> impression this wasn't possible when the loop terminates
>>>>>>>>>>>>> "from within".
>>>>>>>>>>>> Vlad is the expert here - when chatting offline he did
>>>>>>>>>>>> mention that loop should have single exit - which I guess
>>>>>>>>>>>> also takes into account the "normal" exit - so the strlen
>>>>>>>>>>>> routine would seem to have two exits...
>>>>>>>>>>>>
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore
>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>> thanks for sharing your findings - I've done some
>>>>>>>>>>>>>> attempts here with a targeted microbenchmark which
>>>>>>>>>>>>>> measures the performance of string conversion and I'm
>>>>>>>>>>>>>> seeing unrolling and vectorization being applied on the
>>>>>>>>>>>>>> strlen computation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not been
>>>>>>>>>>>>>> updated in the last few weeks? There has been a C2
>>>>>>>>>>>>>> optimization fix which has been added recently, which I
>>>>>>>>>>>>>> think might be related to this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> good idea, but it makes no difference beyond statistical
>>>>>>>>>>>>>>> error.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I started sampling the application with VisualVM (which
>>>>>>>>>>>>>>> is quite hard, since native threads are extremely
>>>>>>>>>>>>>>> short-lived. What I noticed is, that regardless of where
>>>>>>>>>>>>>>> the sampler interrupts a thread, in nearly all cases
>>>>>>>>>>>>>>> 100% of CPU time are caused by
>>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal()
>>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due to the
>>>>>>>>>>>>>>> nature of null termination, but maybe we can make use of
>>>>>>>>>>>>>>> the fact that we're dealing with MemorySegments here:
>>>>>>>>>>>>>>> Since they protect us from overflows, maybe there is no
>>>>>>>>>>>>>>> need to look at only a single byte at a time. Maybe the
>>>>>>>>>>>>>>> strlen()-loop can be unrolled or even be vectorized.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I just did a quick test and observed a x2 speedup when
>>>>>>>>>>>>>>> doing a x4 loop unroll.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee
>>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code, one
>>>>>>>>>>>>>>>> possible explanation for the discrepancy I can think of
>>>>>>>>>>>>>>>> is that the DirFiller ends up using virtual downcalls
>>>>>>>>>>>>>>>> to do it's work, which are currently not intrinsified.
>>>>>>>>>>>>>>>> Being mostly a case of 'not implemented yet', i.e. it
>>>>>>>>>>>>>>>> is a known issue.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>> static fuse_fill_dir_t ofAddress(MemoryAddress
>>>>>>>>>>>>>>>> addr) {
>>>>>>>>>>>>>>>> return (jdk.incubator.foreign.MemoryAddress
>>>>>>>>>>>>>>>> x0, jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>> try {
>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr,
>>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is not a
>>>>>>>>>>>>>>>> constant, so the call is virtual
>>>>>>>>>>>>>>>> } catch (Throwable ex$) {
>>>>>>>>>>>>>>>> throw new AssertionError("should not
>>>>>>>>>>>>>>>> reach here", ex$);
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For testing purposes, a possible workaround could be to
>>>>>>>>>>>>>>>> have a cache that maps the callback address to a method
>>>>>>>>>>>>>>>> handle that has the address bound to the first
>>>>>>>>>>>>>>>> parameter. Assuming readdir always gets the same filler
>>>>>>>>>>>>>>>> callback address, the same MethodHandle will be reused
>>>>>>>>>>>>>>>> and eventually customized which means the callback
>>>>>>>>>>>>>>>> address will become constant, and the downcall should
>>>>>>>>>>>>>>>> then be intrinsified.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine to test
>>>>>>>>>>>>>>>> this, but if you want to try it out, the patch should
>>>>>>>>>>>>>>>> be this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +++
>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>> package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>> import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>> import java.nio.ByteOrder;
>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>> import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>> public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class, fi,
>>>>>>>>>>>>>>>> constants$0.fuse_fill_dir_t$FUNC,
>>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I",
>>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> static fuse_fill_dir_t ofAddress(MemoryAddress
>>>>>>>>>>>>>>>> addr) {
>>>>>>>>>>>>>>>> - return (jdk.incubator.foreign.MemoryAddress
>>>>>>>>>>>>>>>> x0, jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>> - try {
>>>>>>>>>>>>>>>> - return
>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr,
>>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>>> - } catch (Throwable ex$) {
>>>>>>>>>>>>>>>> - throw new AssertionError("should not
>>>>>>>>>>>>>>>> reach here", ex$);
>>>>>>>>>>>>>>>> - }
>>>>>>>>>>>>>>>> - };
>>>>>>>>>>>>>>>> + class CacheHolder {
>>>>>>>>>>>>>>>> + static final Map<MemoryAddress,
>>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new ConcurrentHashMap<>();
>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>> + return CacheHolder.CACHE.computeIfAbsent(addr,
>>>>>>>>>>>>>>>> addrK -> {
>>>>>>>>>>>>>>>> + final MethodHandle target =
>>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH,
>>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long x3) -> {
>>>>>>>>>>>>>>>> + try {
>>>>>>>>>>>>>>>> + return (int)target.invokeExact(x0,
>>>>>>>>>>>>>>>> x1, x2, x3);
>>>>>>>>>>>>>>>> + } catch (Throwable ex$) {
>>>>>>>>>>>>>>>> + throw new AssertionError("should
>>>>>>>>>>>>>>>> not reach here", ex$);
>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>>> + });
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too much by
>>>>>>>>>>>>>>>> line wrapping)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark test,
>>>>>>>>>>>>>>>>> that includes several down- and upcalls. First, let me
>>>>>>>>>>>>>>>>> explain, what I'm testing here:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding, mostly for
>>>>>>>>>>>>>>>>> experimental purposes right now, and I'm trying to
>>>>>>>>>>>>>>>>> beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> While there are some other interesting metrics, such
>>>>>>>>>>>>>>>>> as read/write performance (both sequentially and
>>>>>>>>>>>>>>>>> random access), I focused on directory listings for
>>>>>>>>>>>>>>>>> now. Directory listings are the most complex operation
>>>>>>>>>>>>>>>>> in regards to the number of down- and upcalls:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a callback function
>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item in the
>>>>>>>>>>>>>>>>> directory
>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no longer
>>>>>>>>>>>>>>>>> required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces additional
>>>>>>>>>>>>>>>>> noise (such as readxattr and trying to access files
>>>>>>>>>>>>>>>>> that I didn't report in readdir))
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this:
>>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();` with
>>>>>>>>>>>>>>>>> the volume reporting eight files [2]. When mounting
>>>>>>>>>>>>>>>>> with debug logs enabled, I can see that the exact same
>>>>>>>>>>>>>>>>> operations in the same order are invoked on both
>>>>>>>>>>>>>>>>> fuse-jnr and fuse-panama. One single dir listing
>>>>>>>>>>>>>>>>> results in 2 readdir upcalls, 10 callback downcalls,
>>>>>>>>>>>>>>>>> 16 getattr upcalls. There are also 8 getxattr calls
>>>>>>>>>>>>>>>>> and 16 lookup calls, however they don't reach Java, as
>>>>>>>>>>>>>>>>> the FUSE kernel knows they are not implemented.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>> Benchmark Mode Cnt Score
>>>>>>>>>>>>>>>>> Error Units
>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr avgt 5 66,569 ±
>>>>>>>>>>>>>>>>> 3,128 us/op
>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama avgt 5 189,340 ±
>>>>>>>>>>>>>>>>> 4,275 us/op
>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit 42e03fd7c6a
>>>>>>>>>>>>>>>>> built with: `configure
>>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/
>>>>>>>>>>>>>>>>> --with-native-debug-symbols=none
>>>>>>>>>>>>>>>>> --with-debug-level=release
>>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm
>>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from. Maybe
>>>>>>>>>>>>>>>>> creating a newConfinedScope() during each upcall [3]
>>>>>>>>>>>>>>>>> is "too much"? Maybe JNR is just negligently skipping
>>>>>>>>>>>>>>>>> some memory boundary checks to be faster. The results
>>>>>>>>>>>>>>>>> are not terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$
>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$
>>>>>>>>>>>>>>>>> >
>>>>>>
>>>>>
>>>
>>
More information about the panama-dev
mailing list