[foreign] Mesuring downcall stub performance

Mon Feb 7 13:55:34 UTC 2022

On 07/02/2022 13:36, Cleber Muramoto wrote:
> Hi Maurizio.
>
> Probably git LFS messed up. The keys and trie files should be 95MB and 
> 280MB, respectively.

I downloaded the file directly from github and it now works.

Here's what I get:

unsafe
ns/q: 335.05

native:
ns/q: 339.13

segment:
ns/q: 329.15

(for each, I only listed the best result - not sure how to parse all the 
numbers the benchmark generates, so I only used the very first one :-)).

In general, using the latest Panama EA, I couldn't really tell the three 
benchmark apart, and, also, I noticed that numbers can vary quite a bit 
from run to run.

Maurizio

>
> You'll find the files in this drive link:
>
> https://drive.google.com/drive/folders/1x-imW6MLZNdcJSolmG2B0B06OCYKLU8i 
> <https://urldefense.com/v3/__https://drive.google.com/drive/folders/1x-imW6MLZNdcJSolmG2B0B06OCYKLU8i__;!!ACWV5N9M2RV99hQ!egIK3stNh2Ta57ziYL8r5zD5jXjDvwQjBWZJ1fIG-uUgkm7t0jJQ4ptAqNHBAyqNOfS1PTI$>
>
> The benchmark code for C++ can be found under native folder 
> (bench.cpp), just run
>
> ./compile.bench.sh 
> <https://urldefense.com/v3/__http://compile.bench.sh__;!!ACWV5N9M2RV99hQ!egIK3stNh2Ta57ziYL8r5zD5jXjDvwQjBWZJ1fIG-uUgkm7t0jJQ4ptAqNHBAyqNlkCR9rM$> 
> shared
> ./run.shared.sh 
> <https://urldefense.com/v3/__http://run.shared.sh__;!!ACWV5N9M2RV99hQ!egIK3stNh2Ta57ziYL8r5zD5jXjDvwQjBWZJ1fIG-uUgkm7t0jJQ4ptAqNHBAyqNwUMRFz0$>
>
> I found that odd too that Unsafe got slower in comparison to jdk-17. 
> I'll create a branch with the corresponding code.
>
> Regards
>
> On Mon, Feb 7, 2022 at 8:16 AM Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com> wrote:
>
>     Hi,
>     your estimate of 18ns/call seems accurate. We do have JMH benchmark
>     which measures no-op calls and results are similar.
>
>     We'd need to investigate re. bound checks (but there is some work
>     going
>     on in the area, see [1]) and, more importantly, performance
>     regression
>     w.r.t. 17.
>
>     There are no VM changes from 17 to 18 - the onlt thing that
>     changed was
>     that now we also check by-reference parameters in downcall for
>     liveness
>     - but in this case you're using MemoryAddress parameters, so your
>     code
>     is not impacted by that.
>
>     It's also interesting that the Unsafe path got slower as well in the
>     transition from 17 to 19, which is odd.
>
>     It would also help if the native benchmark had correspoding C/C++
>     code,
>     for portability.
>
>     Thanks
>     Maurizio
>
>     On 05/02/2022 17:01, Cleber Muramoto wrote:
>     > Hello, I'm trying to understand the overhead of calling a
>     downcall stub
>     > generated by jextract.
>     >
>     > The source I'm using can be found on
>     > https://github.com/cmuramoto/test-jextract
>     <https://urldefense.com/v3/__https://github.com/cmuramoto/test-jextract__;!!ACWV5N9M2RV99hQ!egIK3stNh2Ta57ziYL8r5zD5jXjDvwQjBWZJ1fIG-uUgkm7t0jJQ4ptAqNHBAyqNvj7t7fE$>
>     >
>     > Basically, I have the following piece of code
>     >
>     > MemorySegment data = loadFile(); // mmaps a text file with
>     keywords, 1 per
>     > line (\n = separator)
>     > MemorySegment trie = loadTrie(); // mmaps a trie data structure
>     that maps
>     > string->int
>     >
>     > The trie is a contiguous array of pairs (int, int), which are
>     fetched using
>     >
>     > int base(MemorySegment ms,long offset) {
>     >    return ms.get(JAVA_INT, offset<<3);
>     > }
>     >
>     > int check(MemorySegment ms,long offset) {
>     >    return ms.get(JAVA_INT, 4L + ( offset<<3));
>     > }
>     >
>     > And the lookup method is as follows
>     >
>     > // finds the value associated with the key data.slice(pos, end -
>     pos)
>     > long lookup(MemorySegment trie, MemorySegment data, int pos, int
>     end){
>     >    var from = 0L;
>     >    var to = 0L;
>     >
>     > while (pos < end) {
>     > to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
>     > if(check(trie, to) != from) {
>     > return ABSENT; // 1L<<33
>     > }
>     > from = to;
>     > pos++;
>     > }
>     >
>     > to = base(trie, from);
>     > var check = check(trie, to);
>     >
>     >    if (check != i32(from)) {
>     >      return NO_VALUE; // 1L << 32
>     >    }
>     >    return base(array, to);
>     > }
>     >
>     > Running this code on a file with ~10 million lines with a fully
>     populated
>     > trie (so that every lookup succeeds), I measured a peak
>     throughput of
>     > 391ns/query.
>     >
>     > This lookup code suffers from bounds checking (which are
>     unnecessary given
>     > how the trie constrains its 'jumps' from a previous to a current
>     offset,
>     > but the JVM can't possibly know about this) and using Unsafe
>     directly gives
>     > a pretty decent boost, but still can't match a C++ counterpart.
>     >
>     > Based on the lookup code, I created an assembly version that
>     receives the
>     > segment's raw addresses and the length of the string, compiled
>     it with
>     > nasm, and created a shared library with ld:
>     >
>     > nasm -felf64 trie.asm -o trie.o
>     > ld -z noexecstack -shared trie.o -o libtrie.so
>     >
>     > Then generated the stubs with jextract:
>     >
>     > jextract -t trie.asm --source -ltrie trie.h
>     >
>     > trie.h declares de lookup function and a noop ( xor rax, rax
>     ret) which I
>     > used as a baseline to measure a 3-arg call overhead, which in my
>     > environment is about ~18ns/call.
>     >
>     > A compiled C++ code linked against libtrie.so manages to achieve
>     > ~210ns/query.
>     >
>     > My naive expectation was that on the Java side I would get,
>     roughly, C
>     > throughput + 20ns ≃ 230ns/call, but so far I have the following:
>     >
>     > (best out of 10 loops)
>     > jdk-17 branch
>     > -native - 288 ns/query
>     > -unsafe - 282 ns/query
>     >
>     > jdk-19 branch
>     > -native - 372 ns/query
>     > -memory-segments - 391 ns/query
>     > -unsafe - 317 ns/query
>     >
>     > I did not use JMH, but the tests were executed with EpsilonGC
>     and the hot
>     > paths are allocation-free.
>     >
>     > So I wonder, on the JVM side, is there any difference in the
>     intrinsic call
>     > overhead when invoking a noop vs a real function (both with the
>     same type
>     > of arguments)? If not, how can I profile the culprit? I tried
>     > async-profiler but the flame graph seems dominated by the
>     assembly labels
>     > of libtrie.so (especially the inc3)
>     >
>     > Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
>     >
>     > Cheers!
>