[foreign] Mesuring downcall stub performance

Tue Feb 8 17:47:09 UTC 2022

Thank you Maurizio, nice catch on the loop shape logic!

How did you remove the native thread transition? Does it require modifying
and recompiling the JDK?

I put together some JMH benchmarks as well using
@BenchmarkMode(Mode.AverageTime) and created a jdk-17 branch. Results are
very similar, probably I was messing something up in my previous comparison.

# VM version: JDK 17-panama, OpenJDK 64-Bit Server VM, 17-panama+3-167
Benchmark         Mode  Cnt    Score    Error  Units
EstimateCallOverhead.run  avgt    3    9.202 ±  0.145  ns/op
Reader.run        avgt    3  353.020 ± 20.584  ns/op
ReaderNative.run  avgt    3  281.826 ±  3.419  ns/op
ReaderUnsafe.run  avgt    3  298.737 ± 21.465  ns/op

# VM version: JDK 19-panama, OpenJDK 64-Bit Server VM, 19-panama+1-13
Benchmark         Mode  Cnt    Score    Error  Units
EstimateCallOverhead.run  avgt    3    9.778 ±  0.348  ns/op
Reader.run                avgt    3  356.468 ± 22.297  ns/op
ReaderNative.run          avgt    3  289.500 ±  8.928  ns/op
ReaderUnsafe.run          avgt    3  293.697 ±  3.299  ns/op

Regards

On Tue, Feb 8, 2022 at 1:11 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

> Hi,
> I played with the benchmark some more, and I ended up converting it into a
> single JMH benchmark, extending it and adding other variants (e.g. byte
> buffer, memory address).
>
> The benchmark can be found here:
>
> http://cr.openjdk.java.net/~mcimadamore/panama/TrieBenchmark.java
>
> The numbers I get out of JMH are similar to those generated by your
> harness. The numbers I got are below (tested against latest Java 19):
>
> Benchmark                      Mode  Cnt     Score    Error  Units
> TrieBenchmark.address_seek     avgt   30  2561.813 ± 24.697  ms/op
> TrieBenchmark.buffer_seek      avgt   30  3402.664 ± 32.409  ms/op
> TrieBenchmark.native_seek      avgt   30  3251.644 ± 48.709  ms/op
> TrieBenchmark.native_seek_alt  avgt   30  2405.684 ± 14.005  ms/op
> TrieBenchmark.segment_seek     avgt   30  3390.148 ± 15.195  ms/op
> TrieBenchmark.unsafe_seek      avgt   30  2537.085 ± 20.125  ms/op
>
>
> (note that each "op" in JMH is a fulll round of 10M seek, so if you want
> to find the avg time per query in nanoseconds, you can just divide the
> above number by 10).
>
> So, both ByteBuffer and MemorySegment are the slowest. This is, as you
> posited, because of the bound checks. There's no way C2 can optimize those
> checks as the pattern of access is essentially unscrutable by C2 (e.g. we
> read an address from a segment and we use that dynamicaly read address to
> access another location).
>
> Unsafe and MemoryAddress, by dropping bound checks, receive a boost, and
> end up with similar numbers.
>
> The first native reader, which is basically a copy of the code you include
> is indeed slower than expected - it seems as slow as using buffers or
> segment. (Note that running the included C++ benchmark I get ~190ns/op).
>
> I verified there's no GC allocation (-prof gc) and also tried to profile
> (-prof perfasm), but I could only see (as you did) that most time was spent
> in the assembly. So, I thought, what if the native call is not the issue?
> What if the problem is in the surrounding Java code, the one that looks for
> a terminator char?
> So I looked at the C++ more closely and realized that the logic of the
> loop was written in a slightly different way; basically we have two loops
> there, an outer loop and an inner loop which only focuses on finding the
> terminator. This makes the logic a bit more linear. So, I've tried to copy
> that logic into `native_seek_alt` and I got a massive boost.
>
> Initially I thought it might have been an issue with the rewrital, but
> then I put together a test which ran both the old and the new versions and
> ensured that the native function was called (a) the same number of times
> and (b) with the same parameters (and, if not, throw an exception). And, lo
> and behold, no exceptions were thrown.
>
> This seems to get close to the results you were looking for. I don't have
> a really good explanation as to why changing the outer loop affects things
> so much. I imagine that the Java version of the "findSeparator" routine is
> more easily optimized by C2 (since it's a tight loop which accesses
> consecutive positions). I wonder if the presence of the native call in the
> middle of the loop in the base version adds some barriers which then make
> the sequential scan for the separator less optimal. If I replace the call
> in the "seek" function with your "noop" function, I go from 22ns/op to
> 10ns/op, which is ~2x boost. But, if I just do a loop where I call "noop"
> in a loop 10M times, I get down to ~7ns/op. Which seems to indicate that
> 7ns/op is (on my machine) the overhead for a native call (which is
> compatible with our other microbenchmarks), and, in the original version,
> 12ns were spent _outside_ the native call (which, in turns, explains why
> changing the loop shape affects things so much).
>
> I also played a bit with removing native thread transition (as the Java 19
> allows to do so by tweaking some code). With that the overhead per call
> goes down to 1.5ns/op and the overall result for native_seek_alt is
> ~220ns/op. So, transitions make some difference here (expected, given that
> we do native calls back to back), but the overall impact of transitions on
> the throughput remains low.
>
> I hope this helps.
>
> Maurizio
>
>
> On 05/02/2022 17:01, Cleber Muramoto wrote:
>
> Hello, I'm trying to understand the overhead of calling a downcall stub
> generated by jextract.
>
> The source I'm using can be found onhttps://github.com/cmuramoto/test-jextract
>
> Basically, I have the following piece of code
>
> MemorySegment data = loadFile(); // mmaps a text file with keywords, 1 per
> line (\n = separator)
> MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
> string->int
>
> The trie is a contiguous array of pairs (int, int), which are fetched using
>
> int base(MemorySegment ms,long offset) {
>   return ms.get(JAVA_INT, offset<<3);
> }
>
> int check(MemorySegment ms,long offset) {
>   return ms.get(JAVA_INT, 4L + ( offset<<3));
> }
>
> And the lookup method is as follows
>
> // finds the value associated with the key data.slice(pos, end - pos)
> long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
>   var from = 0L;
>   var to = 0L;
>
> while (pos < end) {
> to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
> if(check(trie, to) != from) {
> return ABSENT; // 1L<<33
> }
> from = to;
> pos++;
> }
>
> to = base(trie, from);
> var check = check(trie, to);
>
>   if (check != i32(from)) {
>     return NO_VALUE; // 1L << 32
>   }
>   return base(array, to);
> }
>
> Running this code on a file with ~10 million lines with a fully populated
> trie (so that every lookup succeeds), I measured a peak throughput of
> 391ns/query.
>
> This lookup code suffers from bounds checking (which are unnecessary given
> how the trie constrains its 'jumps' from a previous to a current offset,
> but the JVM can't possibly know about this) and using Unsafe directly gives
> a pretty decent boost, but still can't match a C++ counterpart.
>
> Based on the lookup code, I created an assembly version that receives the
> segment's raw addresses and the length of the string, compiled it with
> nasm, and created a shared library with ld:
>
> nasm -felf64 trie.asm -o trie.o
> ld -z noexecstack -shared trie.o -o libtrie.so
>
> Then generated the stubs with jextract:
>
> jextract -t trie.asm --source -ltrie trie.h
>
> trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
> used as a baseline to measure a 3-arg call overhead, which in my
> environment is about ~18ns/call.
>
> A compiled C++ code linked against libtrie.so manages to achieve
> ~210ns/query.
>
> My naive expectation was that on the Java side I would get, roughly, C
> throughput + 20ns ≃ 230ns/call, but so far I have the following:
>
> (best out of 10 loops)
> jdk-17 branch
> -native - 288 ns/query
> -unsafe - 282 ns/query
>
> jdk-19 branch
> -native - 372 ns/query
> -memory-segments - 391 ns/query
> -unsafe - 317 ns/query
>
> I did not use JMH, but the tests were executed with EpsilonGC and the hot
> paths are allocation-free.
>
> So I wonder, on the JVM side, is there any difference in the intrinsic call
> overhead when invoking a noop vs a real function (both with the same type
> of arguments)? If not, how can I profile the culprit? I tried
> async-profiler but the flame graph seems dominated by the assembly labels
> of libtrie.so (especially the inc3)
>
> Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
>
> Cheers!
>
>