[foreign] Mesuring downcall stub performance

Sat Feb 5 17:01:07 UTC 2022

Hello, I'm trying to understand the overhead of calling a downcall stub
generated by jextract.

The source I'm using can be found on
https://github.com/cmuramoto/test-jextract

Basically, I have the following piece of code

MemorySegment data = loadFile(); // mmaps a text file with keywords, 1 per
line (\n = separator)
MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
string->int

The trie is a contiguous array of pairs (int, int), which are fetched using

int base(MemorySegment ms,long offset) {
  return ms.get(JAVA_INT, offset<<3);
}

int check(MemorySegment ms,long offset) {
  return ms.get(JAVA_INT, 4L + ( offset<<3));
}

And the lookup method is as follows

// finds the value associated with the key data.slice(pos, end - pos)
long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
  var from = 0L;
  var to = 0L;

while (pos < end) {
to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
if(check(trie, to) != from) {
return ABSENT; // 1L<<33
}
from = to;
pos++;
}

to = base(trie, from);
var check = check(trie, to);

  if (check != i32(from)) {
    return NO_VALUE; // 1L << 32
  }
  return base(array, to);
}

Running this code on a file with ~10 million lines with a fully populated
trie (so that every lookup succeeds), I measured a peak throughput of
391ns/query.

This lookup code suffers from bounds checking (which are unnecessary given
how the trie constrains its 'jumps' from a previous to a current offset,
but the JVM can't possibly know about this) and using Unsafe directly gives
a pretty decent boost, but still can't match a C++ counterpart.

Based on the lookup code, I created an assembly version that receives the
segment's raw addresses and the length of the string, compiled it with
nasm, and created a shared library with ld:

nasm -felf64 trie.asm -o trie.o
ld -z noexecstack -shared trie.o -o libtrie.so

Then generated the stubs with jextract:

jextract -t trie.asm --source -ltrie trie.h

trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
used as a baseline to measure a 3-arg call overhead, which in my
environment is about ~18ns/call.

A compiled C++ code linked against libtrie.so manages to achieve
~210ns/query.

My naive expectation was that on the Java side I would get, roughly, C
throughput + 20ns ≃ 230ns/call, but so far I have the following:

(best out of 10 loops)
jdk-17 branch
-native - 288 ns/query
-unsafe - 282 ns/query

jdk-19 branch
-native - 372 ns/query
-memory-segments - 391 ns/query
-unsafe - 317 ns/query

I did not use JMH, but the tests were executed with EpsilonGC and the hot
paths are allocation-free.

So I wonder, on the JVM side, is there any difference in the intrinsic call
overhead when invoking a noop vs a real function (both with the same type
of arguments)? If not, how can I profile the culprit? I tried
async-profiler but the flame graph seems dominated by the assembly labels
of libtrie.so (especially the inc3)

Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?

Cheers!