[foreign] Mesuring downcall stub performance

Mon Feb 7 12:18:19 UTC 2022

On 07/02/2022 12:12, Radosław Smogura wrote:
> Hi,
>
> This is just a wild guess.
>
> I wonder if the issues can be related to the usage of shared scopes, 
> which use the volatile, loops with volatile may be hard to optimize.
>
> Did you consider using non-shared scopes (or derive somehow confined 
> scopes from the shared scope in case of true multithreaded application)?

Not sure - when accessing memory, there's no volatile access anywhere.

You get volatile semantics only when closing - but AFAIK, the benchmark 
creates a single global scope which is then "leaked". So I don't think 
scope can be an issue here.

(hard to say though, given that I can't manage to get the bits running 
:-) ).

Maurizio

>
> Best regards,
> Rado
> ------------------------------------------------------------------------
> *From:* panama-dev <panama-dev-retn at openjdk.java.net> on behalf of 
> Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
> *Sent:* Monday, February 7, 2022 12:56
> *To:* Cleber Muramoto <cleber.muramoto at gmail.com>; 
> panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
> *Subject:* Re: [foreign] Mesuring downcall stub performance
> Foregot to include a link:
>
> [1] -
> https://mail.openjdk.java.net/pipermail/panama-dev/2021-December/015915.html
>
> Btw, I tried your benchmark, but it crashes left and right on my machine
> (Linux, x64).
>
> Seems like some out of bound memory access - I get an exception in the
> MemorySegment test and a plain crash when using unsafe. It seems like
> calls to base() end up returning very big offsets which are then out of
> bounds in the segments where they are used (the data segment is 133
> bytes, while the trie file is 134 bytes). Not sure if something got
> truncated.
>
> Maurizio
>
> On 07/02/2022 11:16, Maurizio Cimadamore wrote:
> > Hi,
> > your estimate of 18ns/call seems accurate. We do have JMH benchmark
> > which measures no-op calls and results are similar.
> >
> > We'd need to investigate re. bound checks (but there is some work
> > going on in the area, see [1]) and, more importantly, performance
> > regression w.r.t. 17.
> >
> > There are no VM changes from 17 to 18 - the onlt thing that changed
> > was that now we also check by-reference parameters in downcall for
> > liveness - but in this case you're using MemoryAddress parameters, so
> > your code is not impacted by that.
> >
> > It's also interesting that the Unsafe path got slower as well in the
> > transition from 17 to 19, which is odd.
> >
> > It would also help if the native benchmark had correspoding C/C++
> > code, for portability.
> >
> > Thanks
> > Maurizio
> >
> > On 05/02/2022 17:01, Cleber Muramoto wrote:
> >> Hello, I'm trying to understand the overhead of calling a downcall stub
> >> generated by jextract.
> >>
> >> The source I'm using can be found on
> >> https://github.com/cmuramoto/test-jextract 
> <https://urldefense.com/v3/__https://github.com/cmuramoto/test-jextract__;!!ACWV5N9M2RV99hQ!ZKDb1Kv_dJLYQQkoJylb_kIlAeCOfLxLDi1HB9TmSUrQ28FX6DmTBMohFbybkTNRK2pF4ys$>
> >>
> >> Basically, I have the following piece of code
> >>
> >> MemorySegment data = loadFile(); // mmaps a text file with keywords,
> >> 1 per
> >> line (\n = separator)
> >> MemorySegment trie = loadTrie(); // mmaps a trie data structure that
> >> maps
> >> string->int
> >>
> >> The trie is a contiguous array of pairs (int, int), which are fetched
> >> using
> >>
> >> int base(MemorySegment ms,long offset) {
> >>    return ms.get(JAVA_INT, offset<<3);
> >> }
> >>
> >> int check(MemorySegment ms,long offset) {
> >>    return ms.get(JAVA_INT, 4L + ( offset<<3));
> >> }
> >>
> >> And the lookup method is as follows
> >>
> >> // finds the value associated with the key data.slice(pos, end - pos)
> >> long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
> >>    var from = 0L;
> >>    var to = 0L;
> >>
> >> while (pos < end) {
> >> to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
> >> if(check(trie, to) != from) {
> >> return ABSENT; // 1L<<33
> >> }
> >> from = to;
> >> pos++;
> >> }
> >>
> >> to = base(trie, from);
> >> var check = check(trie, to);
> >>
> >>    if (check != i32(from)) {
> >>      return NO_VALUE; // 1L << 32
> >>    }
> >>    return base(array, to);
> >> }
> >>
> >> Running this code on a file with ~10 million lines with a fully
> >> populated
> >> trie (so that every lookup succeeds), I measured a peak throughput of
> >> 391ns/query.
> >>
> >> This lookup code suffers from bounds checking (which are unnecessary
> >> given
> >> how the trie constrains its 'jumps' from a previous to a current 
> offset,
> >> but the JVM can't possibly know about this) and using Unsafe directly
> >> gives
> >> a pretty decent boost, but still can't match a C++ counterpart.
> >>
> >> Based on the lookup code, I created an assembly version that receives
> >> the
> >> segment's raw addresses and the length of the string, compiled it with
> >> nasm, and created a shared library with ld:
> >>
> >> nasm -felf64 trie.asm -o trie.o
> >> ld -z noexecstack -shared trie.o -o libtrie.so
> >>
> >> Then generated the stubs with jextract:
> >>
> >> jextract -t trie.asm --source -ltrie trie.h
> >>
> >> trie.h declares de lookup function and a noop ( xor rax, rax ret)
> >> which I
> >> used as a baseline to measure a 3-arg call overhead, which in my
> >> environment is about ~18ns/call.
> >>
> >> A compiled C++ code linked against libtrie.so manages to achieve
> >> ~210ns/query.
> >>
> >> My naive expectation was that on the Java side I would get, roughly, C
> >> throughput + 20ns ≃ 230ns/call, but so far I have the following:
> >>
> >> (best out of 10 loops)
> >> jdk-17 branch
> >> -native - 288 ns/query
> >> -unsafe - 282 ns/query
> >>
> >> jdk-19 branch
> >> -native - 372 ns/query
> >> -memory-segments - 391 ns/query
> >> -unsafe - 317 ns/query
> >>
> >> I did not use JMH, but the tests were executed with EpsilonGC and the
> >> hot
> >> paths are allocation-free.
> >>
> >> So I wonder, on the JVM side, is there any difference in the
> >> intrinsic call
> >> overhead when invoking a noop vs a real function (both with the same
> >> type
> >> of arguments)? If not, how can I profile the culprit? I tried
> >> async-profiler but the flame graph seems dominated by the assembly
> >> labels
> >> of libtrie.so (especially the inc3)
> >>
> >> Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
> >>
> >> Cheers!