[foreign] Mesuring downcall stub performance

Tue Feb 8 16:11:15 UTC 2022

Hi,
I played with the benchmark some more, and I ended up converting it into 
a single JMH benchmark, extending it and adding other variants (e.g. 
byte buffer, memory address).

The benchmark can be found here:

http://cr.openjdk.java.net/~mcimadamore/panama/TrieBenchmark.java

The numbers I get out of JMH are similar to those generated by your 
harness. The numbers I got are below (tested against latest Java 19):

Benchmark                      Mode  Cnt     Score    Error  Units
TrieBenchmark.address_seek     avgt   30  2561.813 ± 24.697  ms/op
TrieBenchmark.buffer_seek      avgt   30  3402.664 ± 32.409  ms/op
TrieBenchmark.native_seek      avgt   30  3251.644 ± 48.709  ms/op
TrieBenchmark.native_seek_alt  avgt   30  2405.684 ± 14.005  ms/op
TrieBenchmark.segment_seek     avgt   30  3390.148 ± 15.195  ms/op
TrieBenchmark.unsafe_seek      avgt   30  2537.085 ± 20.125  ms/op

(note that each "op" in JMH is a fulll round of 10M seek, so if you want 
to find the avg time per query in nanoseconds, you can just divide the 
above number by 10).

So, both ByteBuffer and MemorySegment are the slowest. This is, as you 
posited, because of the bound checks. There's no way C2 can optimize 
those checks as the pattern of access is essentially unscrutable by C2 
(e.g. we read an address from a segment and we use that dynamicaly read 
address to access another location).

Unsafe and MemoryAddress, by dropping bound checks, receive a boost, and 
end up with similar numbers.

The first native reader, which is basically a copy of the code you 
include is indeed slower than expected - it seems as slow as using 
buffers or segment. (Note that running the included C++ benchmark I get 
~190ns/op).

I verified there's no GC allocation (-prof gc) and also tried to profile 
(-prof perfasm), but I could only see (as you did) that most time was 
spent in the assembly. So, I thought, what if the native call is not the 
issue? What if the problem is in the surrounding Java code, the one that 
looks for a terminator char?
So I looked at the C++ more closely and realized that the logic of the 
loop was written in a slightly different way; basically we have two 
loops there, an outer loop and an inner loop which only focuses on 
finding the terminator. This makes the logic a bit more linear. So, I've 
tried to copy that logic into `native_seek_alt` and I got a massive boost.

Initially I thought it might have been an issue with the rewrital, but 
then I put together a test which ran both the old and the new versions 
and ensured that the native function was called (a) the same number of 
times and (b) with the same parameters (and, if not, throw an 
exception). And, lo and behold, no exceptions were thrown.

This seems to get close to the results you were looking for. I don't 
have a really good explanation as to why changing the outer loop affects 
things so much. I imagine that the Java version of the "findSeparator" 
routine is more easily optimized by C2 (since it's a tight loop which 
accesses consecutive positions). I wonder if the presence of the native 
call in the middle of the loop in the base version adds some barriers 
which then make the sequential scan for the separator less optimal. If I 
replace the call in the "seek" function with your "noop" function, I go 
from 22ns/op to 10ns/op, which is ~2x boost. But, if I just do a loop 
where I call "noop" in a loop 10M times, I get down to ~7ns/op. Which 
seems to indicate that 7ns/op is (on my machine) the overhead for a 
native call (which is compatible with our other microbenchmarks), and, 
in the original version, 12ns were spent _outside_ the native call 
(which, in turns, explains why changing the loop shape affects things so 
much).

I also played a bit with removing native thread transition (as the Java 
19 allows to do so by tweaking some code). With that the overhead per 
call goes down to 1.5ns/op and the overall result for native_seek_alt is 
~220ns/op. So, transitions make some difference here (expected, given 
that we do native calls back to back), but the overall impact of 
transitions on the throughput remains low.

I hope this helps.

Maurizio

On 05/02/2022 17:01, Cleber Muramoto wrote:
> Hello, I'm trying to understand the overhead of calling a downcall stub
> generated by jextract.
>
> The source I'm using can be found on
> https://github.com/cmuramoto/test-jextract
>
> Basically, I have the following piece of code
>
> MemorySegment data = loadFile(); // mmaps a text file with keywords, 1 per
> line (\n = separator)
> MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
> string->int
>
> The trie is a contiguous array of pairs (int, int), which are fetched using
>
> int base(MemorySegment ms,long offset) {
>    return ms.get(JAVA_INT, offset<<3);
> }
>
> int check(MemorySegment ms,long offset) {
>    return ms.get(JAVA_INT, 4L + ( offset<<3));
> }
>
> And the lookup method is as follows
>
> // finds the value associated with the key data.slice(pos, end - pos)
> long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
>    var from = 0L;
>    var to = 0L;
>
> while (pos < end) {
> to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
> if(check(trie, to) != from) {
> return ABSENT; // 1L<<33
> }
> from = to;
> pos++;
> }
>
> to = base(trie, from);
> var check = check(trie, to);
>
>    if (check != i32(from)) {
>      return NO_VALUE; // 1L << 32
>    }
>    return base(array, to);
> }
>
> Running this code on a file with ~10 million lines with a fully populated
> trie (so that every lookup succeeds), I measured a peak throughput of
> 391ns/query.
>
> This lookup code suffers from bounds checking (which are unnecessary given
> how the trie constrains its 'jumps' from a previous to a current offset,
> but the JVM can't possibly know about this) and using Unsafe directly gives
> a pretty decent boost, but still can't match a C++ counterpart.
>
> Based on the lookup code, I created an assembly version that receives the
> segment's raw addresses and the length of the string, compiled it with
> nasm, and created a shared library with ld:
>
> nasm -felf64 trie.asm -o trie.o
> ld -z noexecstack -shared trie.o -o libtrie.so
>
> Then generated the stubs with jextract:
>
> jextract -t trie.asm --source -ltrie trie.h
>
> trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
> used as a baseline to measure a 3-arg call overhead, which in my
> environment is about ~18ns/call.
>
> A compiled C++ code linked against libtrie.so manages to achieve
> ~210ns/query.
>
> My naive expectation was that on the Java side I would get, roughly, C
> throughput + 20ns ≃ 230ns/call, but so far I have the following:
>
> (best out of 10 loops)
> jdk-17 branch
> -native - 288 ns/query
> -unsafe - 282 ns/query
>
> jdk-19 branch
> -native - 372 ns/query
> -memory-segments - 391 ns/query
> -unsafe - 317 ns/query
>
> I did not use JMH, but the tests were executed with EpsilonGC and the hot
> paths are allocation-free.
>
> So I wonder, on the JVM side, is there any difference in the intrinsic call
> overhead when invoking a noop vs a real function (both with the same type
> of arguments)? If not, how can I profile the culprit? I tried
> async-profiler but the flame graph seems dominated by the assembly labels
> of libtrie.so (especially the inc3)
>
> Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
>
> Cheers!