Real-Life Benchmark for FUSE's readdir()
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Jul 16 10:18:16 UTC 2021
Here are some observations, based on more testing I did overnight:
* the benchmark is relatively insensitive to intrinsifications being
applied or not - lot of time is spent on the kernel anyway
* Panama has huge GC activity - lot of garbage is generated, compared to
JNR - a lot of int[] arrays, it seems
* Tracking where the allocation is coming from is proving problematic,
as execution originates in aynchronous threads, so most profilers have
issues with that
* Removing support for the OPEN_DIR operation brings the allocation back
in control, and benchmark number back to normal (at least here)
The last bullet doesn't make a lot of sense, but that's what I'm seeing
consistently. Note that it's not the _implementation_ of OPEN_DIR that
is creating the garbage, or doing something strange - simply
_registering_ an empty callback for OPEN_DIR will cause the GC issues
and the slowdown.
I'll look more into this.
Maurizio
On 15/07/2021 22:09, Maurizio Cimadamore wrote:
> Hah - hit "Send" too soon - of course the numbers below are almost
> useless because the file system is set up on a separate JVM which the
> benchmark cannot see.
>
> I'll try to change the benchmark to run the file system in the same
> process.
>
> Maurizio
>
> On 15/07/2021 22:05, Maurizio Cimadamore wrote:
>> I managed to do an initial benchmark pass with JMH. The numbers don't
>> look great:
>>
>> ```
>> Benchmark Mode Cnt Score Error Units
>> BenchmarkTest.testListDirJnr avgt 5 9.858 ± 0.702 us/op
>> BenchmarkTest.testListDirPanama avgt 5 38.008 ± 17.573 us/op
>> ```
>>
>> On my machine the Panama fuse seems 4x slower than the JNR one
>> (assuming I got the implementation correctly, that is :-)).
>>
>> A quick look at GC, reveals that Panama allocates 4x _less_ memory
>> than JNR:
>>
>> ```
>> Benchmark Mode Cnt Score Error Units
>> BenchmarkTest.testListDirJnr:·gc.alloc.rate avgt 5 33.255 ±
>> 1.584 MB/sec
>> BenchmarkTest.testListDirJnr:·gc.alloc.rate.norm avgt 5 368.033
>> ± 0.047 B/op
>> BenchmarkTest.testListDirJnr:·gc.count avgt 5 6.000
>> counts
>> BenchmarkTest.testListDirJnr:·gc.time avgt 5 9.000 ms
>> BenchmarkTest.testListDirPanama:·gc.alloc.rate avgt 5 8.709 ±
>> 3.887 MB/sec
>> BenchmarkTest.testListDirPanama:·gc.alloc.rate.norm avgt 5 368.046
>> ± 0.236 B/op
>> BenchmarkTest.testListDirPanama:·gc.count avgt 5 2.000
>> counts
>> BenchmarkTest.testListDirPanama:·gc.time avgt 5
>> 3.000 ms
>> ```
>>
>> And, looking with perfasm, the distribution of the various methods
>> look similar, and actually not a lot of time is spent in Java at all:
>>
>> JNR:
>>
>> ```
>> ...[Hottest Methods (after
>> inlining)]..............................................................
>> 90.86% kernel [unknown]
>> 2.13% c2, level 4 java.nio.file.Files::list, version 913
>> 1.44% c2, level 4
>> de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirJnr_jmhTest::testListDirJnr_avgt_jmhStub,
>> version 934
>> 0.81% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0,
>> version 857
>> 0.70% libc-2.31.so __close_nocancel
>> 0.48% libc-2.31.so malloc
>> 0.39% libc-2.31.so _int_free
>> 0.32% libc-2.31.so _int_malloc
>> 0.29% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0,
>> version 879
>> 0.22% libc-2.31.so __close
>> 0.21% libc-2.31.so __GI___libc_open
>> 0.19% Unknown, level 0
>> sun.nio.fs.UnixNativeDispatcher::fdopendir, version 859
>> 0.16% libc-2.31.so __libc_enable_asynccancel
>> 0.16% libc-2.31.so __GI___dup
>> 0.15% libc-2.31.so __fxstat64
>> 0.14% Unknown, level 0
>> sun.nio.fs.UnixNativeDispatcher::closedir, version 878
>> 0.14% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
>> 0.13% libc-2.31.so __libc_disable_asynccancel
>> 0.13% libnio.so
>> Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
>> 0.13% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
>> 0.84% <...other 69 warm methods...>
>> ```
>>
>> Panama:
>>
>> ```
>> ....[Hottest Methods (after
>> inlining)]..............................................................
>> 89.07% kernel [unknown]
>> 2.35% c2, level 4 java.nio.file.Files::list, version 890
>> 1.45% c2, level 4
>> de.skymatic.fusepanama.jmh_generated.BenchmarkTest_testListDirPanama_jmhTest::testListDirPanama_avgt_jmhStub,
>> version 917
>> 1.21% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::open0,
>> version 849
>> 0.62% libc-2.31.so malloc
>> 0.60% libc-2.31.so __close_nocancel
>> 0.47% libc-2.31.so _int_free
>> 0.46% libc-2.31.so _int_malloc
>> 0.41% Unknown, level 0 sun.nio.fs.UnixNativeDispatcher::close0,
>> version 869
>> 0.32% libc-2.31.so __close
>> 0.32% libc-2.31.so __GI___libc_open
>> 0.25% Unknown, level 0
>> sun.nio.fs.UnixNativeDispatcher::fdopendir, version 851
>> 0.23% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_open0
>> 0.20% Unknown, level 0
>> sun.nio.fs.UnixNativeDispatcher::closedir, version 868
>> 0.17% libc-2.31.so __GI___dup
>> 0.15% libc-2.31.so __alloc_dir
>> 0.15% libnio.so
>> Java_sun_nio_fs_UnixNativeDispatcher_fdopendir
>> 0.15% libc-2.31.so __libc_disable_asynccancel
>> 0.13% libc-2.31.so __libc_enable_asynccancel
>> 0.12% libnio.so Java_sun_nio_fs_UnixNativeDispatcher_close0
>> 1.17% <...other 81 warm methods...>
>> ```
>>
>> So, no smoking gun so far. I'll keep looking.
>>
>> Maurizio
>>
>> On 15/07/2021 19:14, Maurizio Cimadamore wrote:
>>> Ok, I got it working - what fixed it for me was that the offsets in
>>> the "filler" calls has to be set to zero, which, looking around,
>>> seems like a default value which leaves the kernel to take care of it.
>>>
>>> ```
>>> $ ls Volumes/foo
>>>
>>> total 0
>>> -r--r--r-- 1 root root 0 Jan 1 1970 aaa
>>> -r--r--r-- 1 root root 0 Jan 1 1970 bbb
>>> -r--r--r-- 1 root root 0 Jan 1 1970 ccc
>>> -r--r--r-- 1 root root 0 Jan 1 1970 ddd
>>> -r--r--r-- 1 root root 13 Jan 1 1970 hello.txt
>>> -r--r--r-- 1 root root 0 Jan 1 1970 xxx
>>> -r--r--r-- 1 root root 0 Jan 1 1970 yyy
>>> -r--r--r-- 1 root root 0 Jan 1 1970 zzz
>>> ```
>>>
>>> Something is probably not right (see "total 0" on top), but perhaps
>>> should be good enough for benchmark. Applying same fix on the JNR
>>> also fixes that (and results in same output). Hopefully I'm good to
>>> go now :-)
>>>
>>> Maurizio
>>>
>>> On 15/07/2021 18:15, Maurizio Cimadamore wrote:
>>>> I've re-extracted on Linux, and it seems more lively now. I had to
>>>> fix a couple of type mismatches (e.g. long vs. int and int vs.
>>>> short) in places, and also some of the fields in the "stat"
>>>> structure are different, so the code won't compile as is.
>>>>
>>>> After fixing these minor issue, I see a lot more output printed
>>>> when I mount, and when I do an ls, I see the following lines reported:
>>>>
>>>> [Thread-110] DEBUG
>>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - getattr() /
>>>> [Thread-111] DEBUG
>>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - opendir() /
>>>> [Thread-112] DEBUG
>>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - readdir() /
>>>> [Thread-113] DEBUG
>>>> de.skymatic.fusepanama.examples.HelloPanamaFileSystem - releasedir() /
>>>>
>>>> But that's pretty much it, and in the terminal I still get the
>>>> input/output error. I can even debug, but I can't see much of
>>>> what's going wrong (and I'm not familiar with this API) - the Java
>>>> code executes fine, for what it's worth.
>>>>
>>>> Maurizio
>>>>
>>>> On 15/07/2021 17:51, Sebastian Stenzel wrote:
>>>>> I'll fix it for Linux and let you know!
>>>>>
>>>>>> On 15. Jul 2021, at 18:04, Maurizio Cimadamore
>>>>>> <maurizio.cimadamore at oracle.com
>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>
>>>>>>
>>>>>> On 15/07/2021 17:01, Sebastian Stenzel wrote:
>>>>>>> Yeah well they're pretty much mac-specific. On macOS, FUSE has
>>>>>>> this magic behaviour where you can tell it to mount to a
>>>>>>> non-existing mountpoint inside of `/Volumes/...` and it'll just
>>>>>>> create these (and destroy them on unmount). I believe on Linux
>>>>>>> you need to define an _existing_ mount point. But it is surely
>>>>>>> possible that the volume isn't working yet on Linux. I'll give
>>>>>>> it a try myself and
>>>>>> I created a folder Volumes under my home folder and trying to use
>>>>>> that as a mount point, which seems to work ok with JNR.
>>>>>>> fix it, if required.
>>>>>>>
>>>>>>> The benchmark then needs to be adjusted for the two mountpoints
>>>>>>> respectively. But before the benchmark can actually do anything,
>>>>>>> a plain `ls` on the terminal needs to work.
>>>>>>
>>>>>> Ok, it seems even JNR fails:
>>>>>>
>>>>>> ```
>>>>>> $ ls Volumes/bar/
>>>>>> ls: reading directory 'Volumes/bar/': Input/output error
>>>>>> ```
>>>>>>
>>>>>>>
>>>>>>>> On 15. Jul 2021, at 17:55, Maurizio Cimadamore
>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>
>>>>>>>> I tried to reproduce here on Linux, but with no luck - in the
>>>>>>>> sense that I'm not super sure on how to run the benchmark.
>>>>>>>>
>>>>>>>> I'm able somehow to run the two examples - and I noted that
>>>>>>>> here JNR works ok, while the Panama one doesn't seem to mount
>>>>>>>> things correctly - a new mount appears on my file explorer, but
>>>>>>>> I'm unable to do anything with it (even unmount - which can
>>>>>>>> only be done at sudo level).
>>>>>>>>
>>>>>>>> When working with the JNR support, the mount works fine, it
>>>>>>>> shows in the file explorer, I can click on that location and
>>>>>>>> browse, and then unmount from there. Everything works.
>>>>>>>>
>>>>>>>> That said, the benchmarks require the mount points to be up and
>>>>>>>> running - so I've tried first to execute the example (e.g. JNR)
>>>>>>>> and then run the benchmark in two separate terminal windows,
>>>>>>>> all via Maven - but the benchmark doesn't seem to do anything
>>>>>>>> (I've uncommented the benchmarks of course).
>>>>>>>>
>>>>>>>> How do you run them?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>> On 15/07/2021 13:40, Maurizio Cimadamore wrote:
>>>>>>>>> I believe that it would be more useful to try to run the
>>>>>>>>> perfasm profiler with JMH.
>>>>>>>>>
>>>>>>>>> This can be done relatively easily, at least on linux, if you
>>>>>>>>> pass the argument `-prof perfasm` to JMH. (this would need
>>>>>>>>> hsdid-amd64.so on Linux to print readable assembly).
>>>>>>>>>
>>>>>>>>> Another thing worth checking is allocation rate: `-prof gc`.
>>>>>>>>>
>>>>>>>>> Maurizio
>>>>>>>>>
>>>>>>>>> On 15/07/2021 12:30, Sebastian Stenzel wrote:
>>>>>>>>>> Ok it really seems like VisualVM can't deal with these kinds
>>>>>>>>>> of tasks yet. Now it reports the String constructor being the
>>>>>>>>>> culprit [1], however I strongly doubt that, since this is
>>>>>>>>>> probably one of the most heavily optimized parts of the JDK.
>>>>>>>>>>
>>>>>>>>>> [1]: Screenshot
>>>>>>>>>> onhttps://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$
>>>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BfhWyahw$><https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$
>>>>>>>>>> <https://urldefense.com/v3/__https://imgur.com/a/SHG8RSQ__;!!ACWV5N9M2RV99hQ!ehhusqpkn9dN69mE5vFC8v9OfSP41u5w5sqxb9AdR90lex4N3pFtstT9Xp8CYHYJ6wanAEs$>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 15. Jul 2021, at 13:11, Sebastian Stenzel
>>>>>>>>>>> <sebastian.stenzel at gmail.com
>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com
>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Yes, must be a sampling error. Do you know of a (publicly
>>>>>>>>>>> available) _profiler_ that is compatible with JDK 17 / 18
>>>>>>>>>>> already?
>>>>>>>>>>>
>>>>>>>>>>>> On 15. Jul 2021, at 13:08, Maurizio Cimadamore
>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Aha - it seems like you are seeing what I was seeing:
>>>>>>>>>>>> unrolling now seems to happen more reliably, which
>>>>>>>>>>>> positively affect code like strlen.
>>>>>>>>>>>>
>>>>>>>>>>>> As for FUSE, I think the reason for the difference has
>>>>>>>>>>>> probably nothing to do with string conversion - the sampler
>>>>>>>>>>>> profiler just happens to hit that code a lot. I checked JNR
>>>>>>>>>>>> code for string conversion and I couldn't really find
>>>>>>>>>>>> anything uber optimized in that regard that could explain
>>>>>>>>>>>> the gap.
>>>>>>>>>>>>
>>>>>>>>>>>> Probably something is not getting optimized as it should -
>>>>>>>>>>>> likely a downcall/upcall intrinsification is failing -
>>>>>>>>>>>> maybe due to a subtle issue with your code, or, possibly
>>>>>>>>>>>> because you are hitting a non-implemented case (e.g. we do
>>>>>>>>>>>> not intrinsify calls which pass arguments on the stack,
>>>>>>>>>>>> yet), or because of some other bug.
>>>>>>>>>>>>
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 15/07/2021 12:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>> Wow, I stand corrected. I just re-ran the benchmark and
>>>>>>>>>>>>> `benchmarkStrlenBase` just got a lot faster!! Your change
>>>>>>>>>>>>> inhttps://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BGAGl4kI$><https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43X4WqhUU$>>
>>>>>>>>>>>>> DID have an affect after all.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just doesn't impress FUSE very much...
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 15. Jul 2021, at 13:00, Sebastian Stenzel
>>>>>>>>>>>>>> <sebastian.stenzel at gmail.com
>>>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com><mailto:sebastian.stenzel at gmail.com
>>>>>>>>>>>>>> <mailto:sebastian.stenzel at gmail.com>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yup, I tried the int-approach as well, but with worse
>>>>>>>>>>>>>> results... Here is the full
>>>>>>>>>>>>>> test:https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$
>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!bpEPy4Wm2Z95pbv6wi-y6cOZdcmNVYiNDbUVU5mWMmN6kL38ZycYF4HMgVyv3k5BDE2giEY$><https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$
>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://gist.github.com/overheadhunter/86e7baae7dfe47c49ff364590a4f3ea6__;!!ACWV5N9M2RV99hQ!cbpDiQUPNMGhIDsNGhfDVdyQNOlKB_FkwucN7oxjq20KjZSuJM6qlIQHRXKfpy43gzWigdM$>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:51, Maurizio Cimadamore
>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com><mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok. Thanks.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried similar experiments where instead of reading 4
>>>>>>>>>>>>>>> bytes separately I'd read a single int value, and then
>>>>>>>>>>>>>>> use shifts and bitmasking to check for terminators. On
>>>>>>>>>>>>>>> paper good, but benchmark results were always worse than
>>>>>>>>>>>>>>> the version we have now (at least on Linux).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That said, if you could please share the full string
>>>>>>>>>>>>>>> benchmark you have, that'd be helpful, so we can take a
>>>>>>>>>>>>>>> look at that, and see what's going wrong (ideally, C2
>>>>>>>>>>>>>>> should be the one doing unrolling).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15/07/2021 11:28, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>> I just did a quick snythetic test on a "manually
>>>>>>>>>>>>>>>> unrolled" strlen() without any FUSE context.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I experimented with an implementation that looked like
>>>>>>>>>>>>>>>> the following and benchmarked it using a 259 byte
>>>>>>>>>>>>>>>> memory segment containing a 239 byte string (null byte
>>>>>>>>>>>>>>>> at index 240):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>> private static int strlenUnroll4(MemorySegment segment,
>>>>>>>>>>>>>>>> long start) {
>>>>>>>>>>>>>>>> int offset;
>>>>>>>>>>>>>>>> for (offset = 0; offset < segment.byteSize()-3;
>>>>>>>>>>>>>>>> offset+=4) {
>>>>>>>>>>>>>>>> byte b0 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>>> offset + 0);
>>>>>>>>>>>>>>>> byte b1 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>>> offset + 1);
>>>>>>>>>>>>>>>> byte b2 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>>> offset + 2);
>>>>>>>>>>>>>>>> byte b3 = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>>> offset + 3);
>>>>>>>>>>>>>>>> if (b0 == 0 || b1 == 0 || b2 == 0 || b3 == 0) { // is
>>>>>>>>>>>>>>>> this even faster than directly having 4 different
>>>>>>>>>>>>>>>> branches?
>>>>>>>>>>>>>>>> if (b0 == 0) {
>>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>>> } else if (b1 == 0) {
>>>>>>>>>>>>>>>> return offset + 1;
>>>>>>>>>>>>>>>> } else if (b2 == 0) {
>>>>>>>>>>>>>>>> return offset + 2;
>>>>>>>>>>>>>>>> } else if (b3 == 0) {
>>>>>>>>>>>>>>>> return offset + 3;
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> while (offset < segment.byteSize()) { // TODO: maybe no
>>>>>>>>>>>>>>>> loop required for the remaining <4 bytes?
>>>>>>>>>>>>>>>> byte b = MemoryAccess.getByteAtOffset(segment, start +
>>>>>>>>>>>>>>>> offset);
>>>>>>>>>>>>>>>> if (b == 0) {
>>>>>>>>>>>>>>>> return offset;
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> throw new IllegalArgumentException("String too large");
>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm not even sure how reliable my results are, since I
>>>>>>>>>>>>>>>> have no clue about how branch prediction works here...
>>>>>>>>>>>>>>>> Neither have I tested the correctness of this
>>>>>>>>>>>>>>>> implementation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 15. Jul 2021, at 12:18, Maurizio Cimadamore
>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for reporting back.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> We probably need to investigate this a bit more deeply
>>>>>>>>>>>>>>>>> and try and reproduce on our side.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One last question: you said that with manual unrolling
>>>>>>>>>>>>>>>>> you managed to get 2x faster: did you mean that string
>>>>>>>>>>>>>>>>> conversion got 2x faster or that you actually saw your
>>>>>>>>>>>>>>>>> FUSE benchmark going 2x faster because of the manual
>>>>>>>>>>>>>>>>> unrolling with strings?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 15/07/2021 11:03, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>> That, surprisingly, didn't change anything either.
>>>>>>>>>>>>>>>>>> But don't worry too much, the performance isn't bad
>>>>>>>>>>>>>>>>>> (in absolute figures) and it is by far not the only
>>>>>>>>>>>>>>>>>> reason why I consider panama the best solution to
>>>>>>>>>>>>>>>>>> create java bindings for c libs.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 15:33, Maurizio Cimadamore
>>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>
>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Actually, after some bisecting, I found out that the
>>>>>>>>>>>>>>>>>>> performance of converting a memory segment into a
>>>>>>>>>>>>>>>>>>> string jumped 2x faster with this fix:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$
>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/jdk17/commit/2db9005c07585b580b3ec0889b8b5e3ed0d0ca6a__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStbbmXQfs$>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Which was integrated after the one I originally
>>>>>>>>>>>>>>>>>>> pointed at. They both seem to touch loop
>>>>>>>>>>>>>>>>>>> optimization in case of overflows, which the strlen
>>>>>>>>>>>>>>>>>>> code is triggering (since the loop limit checks for
>>>>>>>>>>>>>>>>>>> loop variable being positive).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> This is a simple patch which adds a string
>>>>>>>>>>>>>>>>>>> conversion test:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> index ec4da5ffc88..5b3fb1a2b2a 100644
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>> a/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +++
>>>>>>>>>>>>>>>>>>> b/test/micro/org/openjdk/bench/jdk/incubator/foreign/StrLenTest.java
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @@ -93,10 +93,13 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>>> FunctionDescriptor.ofVoid(C_POINTER).withAttribute(FunctionDescriptor.TRIVIAL_ATTRIBUTE_NAME,
>>>>>>>>>>>>>>>>>>> true));
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> + MemorySegment segment;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> @Setup
>>>>>>>>>>>>>>>>>>> public void setup() {
>>>>>>>>>>>>>>>>>>> str = makeString(size);
>>>>>>>>>>>>>>>>>>> segmentAllocator =
>>>>>>>>>>>>>>>>>>> SegmentAllocator.ofSegment(MemorySegment.allocateNative(size
>>>>>>>>>>>>>>>>>>> + 1, ResourceScope.newImplicitScope()));
>>>>>>>>>>>>>>>>>>> + segment = toCString(str, segmentAllocator);
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> @TearDown
>>>>>>>>>>>>>>>>>>> @@ -104,6 +107,11 @@ public class StrLenTest {
>>>>>>>>>>>>>>>>>>> scope.close();
>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> + @Benchmark
>>>>>>>>>>>>>>>>>>> + public String panama_str_conv() throws Throwable {
>>>>>>>>>>>>>>>>>>> + return CLinker.toJavaString(segment);
>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> @Benchmark
>>>>>>>>>>>>>>>>>>> public int jni_strlen() throws Throwable {
>>>>>>>>>>>>>>>>>>> return strlen(str);
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Before the above fix, the numbers are as follows:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> Benchmark (size) Mode Cnt
>>>>>>>>>>>>>>>>>>> Score Error Units
>>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv 100 avgt 30
>>>>>>>>>>>>>>>>>>> 106.613 ? 7.060 ns/op
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> While after the fix I get this:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>> Benchmark (size) Mode Cnt
>>>>>>>>>>>>>>>>>>> Score Error Units
>>>>>>>>>>>>>>>>>>> StrLenTest.panama_str_conv 100 avgt 30
>>>>>>>>>>>>>>>>>>> 48.120 ? 0.557 ns/op
>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So, as you can see, a pretty sizeable jump.
>>>>>>>>>>>>>>>>>>> Eyeballing, the shape of generated code doesn't look
>>>>>>>>>>>>>>>>>>> too different, which makes me think of another case
>>>>>>>>>>>>>>>>>>> where loop is unrolled, but main loop never executed
>>>>>>>>>>>>>>>>>>> (similar to JDK-8269230), but we'll need to look
>>>>>>>>>>>>>>>>>>> deeper.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 12/07/2021 14:12, Maurizio Cimadamore wrote:
>>>>>>>>>>>>>>>>>>>> On 12/07/2021 13:18, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>> Hey Maurizio,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> All tests have been done on commit 42e03fd7c6a
>>>>>>>>>>>>>>>>>>>>> (for details how I built the JDK, see my initial
>>>>>>>>>>>>>>>>>>>>> email). Maybe I'm missing some compiler flags to
>>>>>>>>>>>>>>>>>>>>> enable all optimizations?
>>>>>>>>>>>>>>>>>>>> I see - you do have the latest panama changes, but
>>>>>>>>>>>>>>>>>>>> there has been a sync with upstream after that
>>>>>>>>>>>>>>>>>>>> changeset, I believe - can you please try to resync
>>>>>>>>>>>>>>>>>>>> with the latest foreign-jextract commit - which
>>>>>>>>>>>>>>>>>>>> should be:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$
>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>
>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$
>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/openjdk/panama-foreign/commit/b2a284f6678c6a0d78fbdc8655695119ccb0dadb__;!!ACWV5N9M2RV99hQ!bHack1nuOS1oQ5ndwvkBCiYZRnGA23YofE25pg5pKl680ixYi8-4gV4PuZiOieStPAoLo1k$>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Are you sure about loop vectorization being
>>>>>>>>>>>>>>>>>>>>> applied to strlen? I'm not an expert on this
>>>>>>>>>>>>>>>>>>>>> field, but I had the impression this wasn't
>>>>>>>>>>>>>>>>>>>>> possible when the loop terminates "from within".
>>>>>>>>>>>>>>>>>>>> Vlad is the expert here - when chatting offline he
>>>>>>>>>>>>>>>>>>>> did mention that loop should have single exit -
>>>>>>>>>>>>>>>>>>>> which I guess also takes into account the "normal"
>>>>>>>>>>>>>>>>>>>> exit - so the strlen routine would seem to have two
>>>>>>>>>>>>>>>>>>>> exits...
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 12. Jul 2021, at 13:50, Maurizio Cimadamore
>>>>>>>>>>>>>>>>>>>>>> <maurizio.cimadamore at oracle.com
>>>>>>>>>>>>>>>>>>>>>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>> thanks for sharing your findings - I've done some
>>>>>>>>>>>>>>>>>>>>>> attempts here with a targeted microbenchmark
>>>>>>>>>>>>>>>>>>>>>> which measures the performance of string
>>>>>>>>>>>>>>>>>>>>>> conversion and I'm seeing unrolling and
>>>>>>>>>>>>>>>>>>>>>> vectorization being applied on the strlen
>>>>>>>>>>>>>>>>>>>>>> computation.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> May I ask if, by any chance, your HEAD has not
>>>>>>>>>>>>>>>>>>>>>> been updated in the last few weeks? There has
>>>>>>>>>>>>>>>>>>>>>> been a C2 optimization fix which has been added
>>>>>>>>>>>>>>>>>>>>>> recently, which I think might be related to this:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8269230
>>>>>>>>>>>>>>>>>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8269230>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Do you have this fix in the JDK you are using?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 10/07/2021 15:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> good idea, but it makes no difference beyond
>>>>>>>>>>>>>>>>>>>>>>> statistical error.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I started sampling the application with VisualVM
>>>>>>>>>>>>>>>>>>>>>>> (which is quite hard, since native threads are
>>>>>>>>>>>>>>>>>>>>>>> extremely short-lived. What I noticed is, that
>>>>>>>>>>>>>>>>>>>>>>> regardless of where the sampler interrupts a
>>>>>>>>>>>>>>>>>>>>>>> thread, in nearly all cases 100% of CPU time are
>>>>>>>>>>>>>>>>>>>>>>> caused by
>>>>>>>>>>>>>>>>>>>>>>> jdk.internal.foreign.abi.SharedUtils.toJavaStringInternal()
>>>>>>>>>>>>>>>>>>>>>>> → jdk.internal.foreign.abi.SharedUtils.strlen().
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I know that strlen can hardly be optimized due
>>>>>>>>>>>>>>>>>>>>>>> to the nature of null termination, but maybe we
>>>>>>>>>>>>>>>>>>>>>>> can make use of the fact that we're dealing with
>>>>>>>>>>>>>>>>>>>>>>> MemorySegments here: Since they protect us from
>>>>>>>>>>>>>>>>>>>>>>> overflows, maybe there is no need to look at
>>>>>>>>>>>>>>>>>>>>>>> only a single byte at a time. Maybe the
>>>>>>>>>>>>>>>>>>>>>>> strlen()-loop can be unrolled or even be
>>>>>>>>>>>>>>>>>>>>>>> vectorized.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I just did a quick test and observed a x2
>>>>>>>>>>>>>>>>>>>>>>> speedup when doing a x4 loop unroll.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 9. Jul 2021, at 20:30, Jorn Vernee
>>>>>>>>>>>>>>>>>>>>>>>> <jorn.vernee at oracle.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Sebastian,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for testing this. Looking at your code,
>>>>>>>>>>>>>>>>>>>>>>>> one possible explanation for the discrepancy I
>>>>>>>>>>>>>>>>>>>>>>>> can think of is that the DirFiller ends up
>>>>>>>>>>>>>>>>>>>>>>>> using virtual downcalls to do it's work, which
>>>>>>>>>>>>>>>>>>>>>>>> are currently not intrinsified. Being mostly a
>>>>>>>>>>>>>>>>>>>>>>>> case of 'not implemented yet', i.e. it is a
>>>>>>>>>>>>>>>>>>>>>>>> known issue.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>> static fuse_fill_dir_t
>>>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long
>>>>>>>>>>>>>>>>>>>>>>>> x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>> try {
>>>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr,
>>>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3); // <--------- 'addr' here is
>>>>>>>>>>>>>>>>>>>>>>>> not a constant, so the call is virtual
>>>>>>>>>>>>>>>>>>>>>>>> } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>> throw new
>>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> };
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> For testing purposes, a possible workaround
>>>>>>>>>>>>>>>>>>>>>>>> could be to have a cache that maps the callback
>>>>>>>>>>>>>>>>>>>>>>>> address to a method handle that has the address
>>>>>>>>>>>>>>>>>>>>>>>> bound to the first parameter. Assuming readdir
>>>>>>>>>>>>>>>>>>>>>>>> always gets the same filler callback address,
>>>>>>>>>>>>>>>>>>>>>>>> the same MethodHandle will be reused and
>>>>>>>>>>>>>>>>>>>>>>>> eventually customized which means the callback
>>>>>>>>>>>>>>>>>>>>>>>> address will become constant, and the downcall
>>>>>>>>>>>>>>>>>>>>>>>> should then be intrinsified.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I don't currently have access to a Mac machine
>>>>>>>>>>>>>>>>>>>>>>>> to test this, but if you want to try it out,
>>>>>>>>>>>>>>>>>>>>>>>> the patch should be this:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> index bfd4655..4c68d4c 100644
>>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>>> a/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> +++
>>>>>>>>>>>>>>>>>>>>>>>> b/src/main/java/de/skymatic/fusepanama/lowlevel/fuse_fill_dir_t.java
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> @@ -3,8 +3,12 @@
>>>>>>>>>>>>>>>>>>>>>>>> package de.skymatic.fusepanama.lowlevel;
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> import java.lang.invoke.MethodHandle;
>>>>>>>>>>>>>>>>>>>>>>>> +import java.lang.invoke.MethodHandles;
>>>>>>>>>>>>>>>>>>>>>>>> import java.lang.invoke.VarHandle;
>>>>>>>>>>>>>>>>>>>>>>>> import java.nio.ByteOrder;
>>>>>>>>>>>>>>>>>>>>>>>> +import java.util.Map;
>>>>>>>>>>>>>>>>>>>>>>>> +import java.util.concurrent.ConcurrentHashMap;
>>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>>> import jdk.incubator.foreign.*;
>>>>>>>>>>>>>>>>>>>>>>>> import static jdk.incubator.foreign.CLinker.*;
>>>>>>>>>>>>>>>>>>>>>>>> public interface fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>>> @@ -17,13 +21,19 @@ public interface
>>>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t {
>>>>>>>>>>>>>>>>>>>>>>>> return
>>>>>>>>>>>>>>>>>>>>>>>> RuntimeHelper.upcallStub(fuse_fill_dir_t.class,
>>>>>>>>>>>>>>>>>>>>>>>> fi, constants$0.fuse_fill_dir_t$FUNC,
>>>>>>>>>>>>>>>>>>>>>>>> "(Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;Ljdk/incubator/foreign/MemoryAddress;J)I",
>>>>>>>>>>>>>>>>>>>>>>>> scope);
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> static fuse_fill_dir_t
>>>>>>>>>>>>>>>>>>>>>>>> ofAddress(MemoryAddress addr) {
>>>>>>>>>>>>>>>>>>>>>>>> - return
>>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long
>>>>>>>>>>>>>>>>>>>>>>>> x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>> - try {
>>>>>>>>>>>>>>>>>>>>>>>> - return
>>>>>>>>>>>>>>>>>>>>>>>> (int)constants$0.fuse_fill_dir_t$MH.invokeExact((Addressable)addr,
>>>>>>>>>>>>>>>>>>>>>>>> x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>>> - } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>> - throw new
>>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>> - }
>>>>>>>>>>>>>>>>>>>>>>>> - };
>>>>>>>>>>>>>>>>>>>>>>>> + class CacheHolder {
>>>>>>>>>>>>>>>>>>>>>>>> + static final Map<MemoryAddress,
>>>>>>>>>>>>>>>>>>>>>>>> fuse_fill_dir_t> CACHE = new
>>>>>>>>>>>>>>>>>>>>>>>> ConcurrentHashMap<>();
>>>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>>>> CacheHolder.CACHE.computeIfAbsent(addr, addrK -> {
>>>>>>>>>>>>>>>>>>>>>>>> + final MethodHandle target =
>>>>>>>>>>>>>>>>>>>>>>>> MethodHandles.insertArguments(constants$0.fuse_fill_dir_t$MH,
>>>>>>>>>>>>>>>>>>>>>>>> 0, addrK);
>>>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>>>> (jdk.incubator.foreign.MemoryAddress x0,
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x1,
>>>>>>>>>>>>>>>>>>>>>>>> jdk.incubator.foreign.MemoryAddress x2, long
>>>>>>>>>>>>>>>>>>>>>>>> x3) -> {
>>>>>>>>>>>>>>>>>>>>>>>> + try {
>>>>>>>>>>>>>>>>>>>>>>>> + return
>>>>>>>>>>>>>>>>>>>>>>>> (int)target.invokeExact(x0, x1, x2, x3);
>>>>>>>>>>>>>>>>>>>>>>>> + } catch (Throwable ex$) {
>>>>>>>>>>>>>>>>>>>>>>>> + throw new
>>>>>>>>>>>>>>>>>>>>>>>> AssertionError("should not reach here", ex$);
>>>>>>>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>>>>>>>> + };
>>>>>>>>>>>>>>>>>>>>>>>> + });
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>> (I hope these code blocks don't get mangled too
>>>>>>>>>>>>>>>>>>>>>>>> much by line wrapping)
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> HTH,
>>>>>>>>>>>>>>>>>>>>>>>> Jorn
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 09/07/2021 10:58, Sebastian Stenzel wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I wanted to share the results of a benchmark
>>>>>>>>>>>>>>>>>>>>>>>>> test, that includes several down- and upcalls.
>>>>>>>>>>>>>>>>>>>>>>>>> First, let me explain, what I'm testing here:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I'm working on a panama-based FUSE binding,
>>>>>>>>>>>>>>>>>>>>>>>>> mostly for experimental purposes right now,
>>>>>>>>>>>>>>>>>>>>>>>>> and I'm trying to beat fuse-jnr [1].
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> While there are some other interesting
>>>>>>>>>>>>>>>>>>>>>>>>> metrics, such as read/write performance (both
>>>>>>>>>>>>>>>>>>>>>>>>> sequentially and random access), I focused on
>>>>>>>>>>>>>>>>>>>>>>>>> directory listings for now. Directory listings
>>>>>>>>>>>>>>>>>>>>>>>>> are the most complex operation in regards to
>>>>>>>>>>>>>>>>>>>>>>>>> the number of down- and upcalls:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 1. FUSE upcalls readdir and provides a
>>>>>>>>>>>>>>>>>>>>>>>>> callback function
>>>>>>>>>>>>>>>>>>>>>>>>> 2. java downcalls the callback for each item
>>>>>>>>>>>>>>>>>>>>>>>>> in the directory
>>>>>>>>>>>>>>>>>>>>>>>>> 3. FUSE upcalls getattr for each item (no
>>>>>>>>>>>>>>>>>>>>>>>>> longer required with "readdirplus" in FUSE 3.x)
>>>>>>>>>>>>>>>>>>>>>>>>> (4. I'm testing on macOS, which introduces
>>>>>>>>>>>>>>>>>>>>>>>>> additional noise (such as readxattr and trying
>>>>>>>>>>>>>>>>>>>>>>>>> to access files that I didn't report in readdir))
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> So, what I'm testing is essentially this:
>>>>>>>>>>>>>>>>>>>>>>>>> `Files.list(Path.of("/Volumes/foo")).close();`
>>>>>>>>>>>>>>>>>>>>>>>>> with the volume reporting eight files [2].
>>>>>>>>>>>>>>>>>>>>>>>>> When mounting with debug logs enabled, I can
>>>>>>>>>>>>>>>>>>>>>>>>> see that the exact same operations in the same
>>>>>>>>>>>>>>>>>>>>>>>>> order are invoked on both fuse-jnr and
>>>>>>>>>>>>>>>>>>>>>>>>> fuse-panama. One single dir listing results in
>>>>>>>>>>>>>>>>>>>>>>>>> 2 readdir upcalls, 10 callback downcalls, 16
>>>>>>>>>>>>>>>>>>>>>>>>> getattr upcalls. There are also 8 getxattr
>>>>>>>>>>>>>>>>>>>>>>>>> calls and 16 lookup calls, however they don't
>>>>>>>>>>>>>>>>>>>>>>>>> reach Java, as the FUSE kernel knows they are
>>>>>>>>>>>>>>>>>>>>>>>>> not implemented.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Long story short, here are the results:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>> Benchmark Mode Cnt
>>>>>>>>>>>>>>>>>>>>>>>>> Score Error Units
>>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirJnr avgt 5
>>>>>>>>>>>>>>>>>>>>>>>>> 66,569 ± 3,128 us/op
>>>>>>>>>>>>>>>>>>>>>>>>> BenchmarkTest.testListDirPanama avgt 5
>>>>>>>>>>>>>>>>>>>>>>>>> 189,340 ± 4,275 us/op
>>>>>>>>>>>>>>>>>>>>>>>>> ```
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I've been using panama snapshot at commit
>>>>>>>>>>>>>>>>>>>>>>>>> 42e03fd7c6a built with: `configure
>>>>>>>>>>>>>>>>>>>>>>>>> --with-boot-jdk=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home/
>>>>>>>>>>>>>>>>>>>>>>>>> --with-native-debug-symbols=none
>>>>>>>>>>>>>>>>>>>>>>>>> --with-debug-level=release
>>>>>>>>>>>>>>>>>>>>>>>>> --with-libclang=/usr/local/opt/llvm
>>>>>>>>>>>>>>>>>>>>>>>>> --with-libclang-version=12`
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I can't tell where this overhead comes from.
>>>>>>>>>>>>>>>>>>>>>>>>> Maybe creating a newConfinedScope() during
>>>>>>>>>>>>>>>>>>>>>>>>> each upcall [3] is "too much"? Maybe JNR is
>>>>>>>>>>>>>>>>>>>>>>>>> just negligently skipping some memory boundary
>>>>>>>>>>>>>>>>>>>>>>>>> checks to be faster. The results are not
>>>>>>>>>>>>>>>>>>>>>>>>> terrible, but I'd hoped for something better.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Sebastian
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> [1]https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$
>>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/SerCeMan/jnr-fuse__;!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZKIAfyrY$
>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>> [2]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$
>>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/develop/src/test/java/de/skymatic/fusepanama/examples/HelloPanamaFileSystem.java*L139-L146__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZrbEQfzQ$
>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>>>>> [3]https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$
>>>>>>>>>>>>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/skymatic/fuse-panama/blob/769347575863861063a2347a42b2cbaadb5eacef/src/main/java/de/skymatic/fusepanama/FuseOperations.java*L67-L71__;Iw!!ACWV5N9M2RV99hQ!deSMwGndYEdMIZ2Fn4rLom81ulNtUdkK-4zBkp_0YUNnjKszGqmKu404ru2DZGfZ9Xy3UhQ$
>>>>>>>>>>>>>>>>>>>>>>>>> >
>>>>>
More information about the panama-dev
mailing list