[foreign] Mesuring downcall stub performance

Mon Feb 7 11:16:14 UTC 2022

Hi,
your estimate of 18ns/call seems accurate. We do have JMH benchmark 
which measures no-op calls and results are similar.

We'd need to investigate re. bound checks (but there is some work going 
on in the area, see [1]) and, more importantly, performance regression 
w.r.t. 17.

There are no VM changes from 17 to 18 - the onlt thing that changed was 
that now we also check by-reference parameters in downcall for liveness 
- but in this case you're using MemoryAddress parameters, so your code 
is not impacted by that.

It's also interesting that the Unsafe path got slower as well in the 
transition from 17 to 19, which is odd.

It would also help if the native benchmark had correspoding C/C++ code, 
for portability.

Thanks
Maurizio

On 05/02/2022 17:01, Cleber Muramoto wrote:
> Hello, I'm trying to understand the overhead of calling a downcall stub
> generated by jextract.
>
> The source I'm using can be found on
> https://github.com/cmuramoto/test-jextract
>
> Basically, I have the following piece of code
>
> MemorySegment data = loadFile(); // mmaps a text file with keywords, 1 per
> line (\n = separator)
> MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
> string->int
>
> The trie is a contiguous array of pairs (int, int), which are fetched using
>
> int base(MemorySegment ms,long offset) {
>    return ms.get(JAVA_INT, offset<<3);
> }
>
> int check(MemorySegment ms,long offset) {
>    return ms.get(JAVA_INT, 4L + ( offset<<3));
> }
>
> And the lookup method is as follows
>
> // finds the value associated with the key data.slice(pos, end - pos)
> long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
>    var from = 0L;
>    var to = 0L;
>
> while (pos < end) {
> to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
> if(check(trie, to) != from) {
> return ABSENT; // 1L<<33
> }
> from = to;
> pos++;
> }
>
> to = base(trie, from);
> var check = check(trie, to);
>
>    if (check != i32(from)) {
>      return NO_VALUE; // 1L << 32
>    }
>    return base(array, to);
> }
>
> Running this code on a file with ~10 million lines with a fully populated
> trie (so that every lookup succeeds), I measured a peak throughput of
> 391ns/query.
>
> This lookup code suffers from bounds checking (which are unnecessary given
> how the trie constrains its 'jumps' from a previous to a current offset,
> but the JVM can't possibly know about this) and using Unsafe directly gives
> a pretty decent boost, but still can't match a C++ counterpart.
>
> Based on the lookup code, I created an assembly version that receives the
> segment's raw addresses and the length of the string, compiled it with
> nasm, and created a shared library with ld:
>
> nasm -felf64 trie.asm -o trie.o
> ld -z noexecstack -shared trie.o -o libtrie.so
>
> Then generated the stubs with jextract:
>
> jextract -t trie.asm --source -ltrie trie.h
>
> trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
> used as a baseline to measure a 3-arg call overhead, which in my
> environment is about ~18ns/call.
>
> A compiled C++ code linked against libtrie.so manages to achieve
> ~210ns/query.
>
> My naive expectation was that on the Java side I would get, roughly, C
> throughput + 20ns ≃ 230ns/call, but so far I have the following:
>
> (best out of 10 loops)
> jdk-17 branch
> -native - 288 ns/query
> -unsafe - 282 ns/query
>
> jdk-19 branch
> -native - 372 ns/query
> -memory-segments - 391 ns/query
> -unsafe - 317 ns/query
>
> I did not use JMH, but the tests were executed with EpsilonGC and the hot
> paths are allocation-free.
>
> So I wonder, on the JVM side, is there any difference in the intrinsic call
> overhead when invoking a noop vs a real function (both with the same type
> of arguments)? If not, how can I profile the culprit? I tried
> async-profiler but the flame graph seems dominated by the assembly labels
> of libtrie.so (especially the inc3)
>
> Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
>
> Cheers!