[foreign] Mesuring downcall stub performance

Cleber Muramoto cleber.muramoto at gmail.com
Mon Feb 7 13:36:35 UTC 2022


Hi Maurizio.

Probably git LFS messed up. The keys and trie files should be 95MB and
280MB, respectively.

You'll find the files in this drive link:

https://drive.google.com/drive/folders/1x-imW6MLZNdcJSolmG2B0B06OCYKLU8i

The benchmark code for C++ can be found under native folder (bench.cpp),
just run

./compile.bench.sh shared
./run.shared.sh

I found that odd too that Unsafe got slower in comparison to jdk-17. I'll
create a branch with the corresponding code.

Regards

On Mon, Feb 7, 2022 at 8:16 AM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

> Hi,
> your estimate of 18ns/call seems accurate. We do have JMH benchmark
> which measures no-op calls and results are similar.
>
> We'd need to investigate re. bound checks (but there is some work going
> on in the area, see [1]) and, more importantly, performance regression
> w.r.t. 17.
>
> There are no VM changes from 17 to 18 - the onlt thing that changed was
> that now we also check by-reference parameters in downcall for liveness
> - but in this case you're using MemoryAddress parameters, so your code
> is not impacted by that.
>
> It's also interesting that the Unsafe path got slower as well in the
> transition from 17 to 19, which is odd.
>
> It would also help if the native benchmark had correspoding C/C++ code,
> for portability.
>
> Thanks
> Maurizio
>
> On 05/02/2022 17:01, Cleber Muramoto wrote:
> > Hello, I'm trying to understand the overhead of calling a downcall stub
> > generated by jextract.
> >
> > The source I'm using can be found on
> > https://github.com/cmuramoto/test-jextract
> >
> > Basically, I have the following piece of code
> >
> > MemorySegment data = loadFile(); // mmaps a text file with keywords, 1
> per
> > line (\n = separator)
> > MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
> > string->int
> >
> > The trie is a contiguous array of pairs (int, int), which are fetched
> using
> >
> > int base(MemorySegment ms,long offset) {
> >    return ms.get(JAVA_INT, offset<<3);
> > }
> >
> > int check(MemorySegment ms,long offset) {
> >    return ms.get(JAVA_INT, 4L + ( offset<<3));
> > }
> >
> > And the lookup method is as follows
> >
> > // finds the value associated with the key data.slice(pos, end - pos)
> > long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
> >    var from = 0L;
> >    var to = 0L;
> >
> > while (pos < end) {
> > to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
> > if(check(trie, to) != from) {
> > return ABSENT; // 1L<<33
> > }
> > from = to;
> > pos++;
> > }
> >
> > to = base(trie, from);
> > var check = check(trie, to);
> >
> >    if (check != i32(from)) {
> >      return NO_VALUE; // 1L << 32
> >    }
> >    return base(array, to);
> > }
> >
> > Running this code on a file with ~10 million lines with a fully populated
> > trie (so that every lookup succeeds), I measured a peak throughput of
> > 391ns/query.
> >
> > This lookup code suffers from bounds checking (which are unnecessary
> given
> > how the trie constrains its 'jumps' from a previous to a current offset,
> > but the JVM can't possibly know about this) and using Unsafe directly
> gives
> > a pretty decent boost, but still can't match a C++ counterpart.
> >
> > Based on the lookup code, I created an assembly version that receives the
> > segment's raw addresses and the length of the string, compiled it with
> > nasm, and created a shared library with ld:
> >
> > nasm -felf64 trie.asm -o trie.o
> > ld -z noexecstack -shared trie.o -o libtrie.so
> >
> > Then generated the stubs with jextract:
> >
> > jextract -t trie.asm --source -ltrie trie.h
> >
> > trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
> > used as a baseline to measure a 3-arg call overhead, which in my
> > environment is about ~18ns/call.
> >
> > A compiled C++ code linked against libtrie.so manages to achieve
> > ~210ns/query.
> >
> > My naive expectation was that on the Java side I would get, roughly, C
> > throughput + 20ns ≃ 230ns/call, but so far I have the following:
> >
> > (best out of 10 loops)
> > jdk-17 branch
> > -native - 288 ns/query
> > -unsafe - 282 ns/query
> >
> > jdk-19 branch
> > -native - 372 ns/query
> > -memory-segments - 391 ns/query
> > -unsafe - 317 ns/query
> >
> > I did not use JMH, but the tests were executed with EpsilonGC and the hot
> > paths are allocation-free.
> >
> > So I wonder, on the JVM side, is there any difference in the intrinsic
> call
> > overhead when invoking a noop vs a real function (both with the same type
> > of arguments)? If not, how can I profile the culprit? I tried
> > async-profiler but the flame graph seems dominated by the assembly labels
> > of libtrie.so (especially the inc3)
> >
> > Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
> >
> > Cheers!
>


More information about the panama-dev mailing list