[foreign] Mesuring downcall stub performance
Cleber Muramoto
cleber.muramoto at gmail.com
Sat Feb 5 17:01:07 UTC 2022
Hello, I'm trying to understand the overhead of calling a downcall stub
generated by jextract.
The source I'm using can be found on
https://github.com/cmuramoto/test-jextract
Basically, I have the following piece of code
MemorySegment data = loadFile(); // mmaps a text file with keywords, 1 per
line (\n = separator)
MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
string->int
The trie is a contiguous array of pairs (int, int), which are fetched using
int base(MemorySegment ms,long offset) {
return ms.get(JAVA_INT, offset<<3);
}
int check(MemorySegment ms,long offset) {
return ms.get(JAVA_INT, 4L + ( offset<<3));
}
And the lookup method is as follows
// finds the value associated with the key data.slice(pos, end - pos)
long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
var from = 0L;
var to = 0L;
while (pos < end) {
to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
if(check(trie, to) != from) {
return ABSENT; // 1L<<33
}
from = to;
pos++;
}
to = base(trie, from);
var check = check(trie, to);
if (check != i32(from)) {
return NO_VALUE; // 1L << 32
}
return base(array, to);
}
Running this code on a file with ~10 million lines with a fully populated
trie (so that every lookup succeeds), I measured a peak throughput of
391ns/query.
This lookup code suffers from bounds checking (which are unnecessary given
how the trie constrains its 'jumps' from a previous to a current offset,
but the JVM can't possibly know about this) and using Unsafe directly gives
a pretty decent boost, but still can't match a C++ counterpart.
Based on the lookup code, I created an assembly version that receives the
segment's raw addresses and the length of the string, compiled it with
nasm, and created a shared library with ld:
nasm -felf64 trie.asm -o trie.o
ld -z noexecstack -shared trie.o -o libtrie.so
Then generated the stubs with jextract:
jextract -t trie.asm --source -ltrie trie.h
trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
used as a baseline to measure a 3-arg call overhead, which in my
environment is about ~18ns/call.
A compiled C++ code linked against libtrie.so manages to achieve
~210ns/query.
My naive expectation was that on the Java side I would get, roughly, C
throughput + 20ns ≃ 230ns/call, but so far I have the following:
(best out of 10 loops)
jdk-17 branch
-native - 288 ns/query
-unsafe - 282 ns/query
jdk-19 branch
-native - 372 ns/query
-memory-segments - 391 ns/query
-unsafe - 317 ns/query
I did not use JMH, but the tests were executed with EpsilonGC and the hot
paths are allocation-free.
So I wonder, on the JVM side, is there any difference in the intrinsic call
overhead when invoking a noop vs a real function (both with the same type
of arguments)? If not, how can I profile the culprit? I tried
async-profiler but the flame graph seems dominated by the assembly labels
of libtrie.so (especially the inc3)
Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
Cheers!
More information about the panama-dev
mailing list