[foreign] Mesuring downcall stub performance

Tue Feb 8 18:22:16 UTC 2022

Hi,
this is the crude patch I used to disable state transition.

diff --git 
a/src/java.base/share/classes/jdk/internal/invoke/NativeEntryPoint.java 
b/src/java.base/share/classes/jdk/internal/invoke/NativeEntryPoint.java
index 12e832a582d..2b027d5186e 100644
--- a/src/java.base/share/classes/jdk/internal/invoke/NativeEntryPoint.java
+++ b/src/java.base/share/classes/jdk/internal/invoke/NativeEntryPoint.java
@@ -52,7 +52,7 @@ public class NativeEntryPoint {
          this.shadowSpace = shadowSpace;
          this.argMoves = Objects.requireNonNull(argMoves);
          this.returnMoves = Objects.requireNonNull(returnMoves);
-        this.needTransition = needTransition;
+        this.needTransition = false;
          this.methodType = methodType;
          this.name = name;
      }

The C linker machinery is able to omit state transitions - but we don't 
currently expose a way for developers to select that behavior (we might 
in the future). In general, unless you are doing pretty specific stuff, 
the cost of the transition (~5ns) is negligible compared to everything 
else going on in the application.

But of course if you want to compare Java code with C++, that cost has 
to be taken into the equation as well (which is why I tried, just for 
fun, to see how close we could get).

Maurizio


On 08/02/2022 17:47, Cleber Muramoto wrote:
> Thank you Maurizio, nice catch on the loop shape logic!
>
> How did you remove the native thread transition? Does it require 
> modifying and recompiling the JDK?
>
> I put together some JMH benchmarks as well using 
> @BenchmarkMode(Mode.AverageTime) and created a jdk-17 branch. Results 
> are very similar, probably I was messing something up in my previous 
> comparison.
>
> # VM version: JDK 17-panama, OpenJDK 64-Bit Server VM, 17-panama+3-167
> Benchmark         Mode  Cnt    Score    Error  Units
> EstimateCallOverhead.run  avgt    3    9.202 ±  0.145  ns/op
> Reader.run        avgt    3  353.020 ± 20.584  ns/op
> ReaderNative.run  avgt    3  281.826 ±  3.419  ns/op
> ReaderUnsafe.run  avgt    3  298.737 ± 21.465  ns/op
>
> # VM version: JDK 19-panama, OpenJDK 64-Bit Server VM, 19-panama+1-13
> Benchmark         Mode  Cnt    Score    Error  Units
> EstimateCallOverhead.run  avgt    3    9.778 ±  0.348  ns/op
> Reader.run                avgt    3  356.468 ± 22.297  ns/op
> ReaderNative.run          avgt    3  289.500 ±  8.928  ns/op
> ReaderUnsafe.run          avgt    3  293.697 ±  3.299  ns/op
>
>
> Regards
>
> On Tue, Feb 8, 2022 at 1:11 PM Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com> wrote:
>
>     Hi,
>     I played with the benchmark some more, and I ended up converting
>     it into a single JMH benchmark, extending it and adding other
>     variants (e.g. byte buffer, memory address).
>
>     The benchmark can be found here:
>
>     http://cr.openjdk.java.net/~mcimadamore/panama/TrieBenchmark.java
>
>     The numbers I get out of JMH are similar to those generated by
>     your harness. The numbers I got are below (tested against latest
>     Java 19):
>
>     Benchmark                      Mode  Cnt     Score    Error  Units
>     TrieBenchmark.address_seek     avgt   30  2561.813 ± 24.697  ms/op
>     TrieBenchmark.buffer_seek      avgt   30  3402.664 ± 32.409  ms/op
>     TrieBenchmark.native_seek      avgt   30  3251.644 ± 48.709  ms/op
>     TrieBenchmark.native_seek_alt  avgt   30  2405.684 ± 14.005  ms/op
>     TrieBenchmark.segment_seek     avgt   30  3390.148 ± 15.195  ms/op
>     TrieBenchmark.unsafe_seek      avgt   30  2537.085 ± 20.125  ms/op
>
>
>     (note that each "op" in JMH is a fulll round of 10M seek, so if
>     you want to find the avg time per query in nanoseconds, you can
>     just divide the above number by 10).
>
>     So, both ByteBuffer and MemorySegment are the slowest. This is, as
>     you posited, because of the bound checks. There's no way C2 can
>     optimize those checks as the pattern of access is essentially
>     unscrutable by C2 (e.g. we read an address from a segment and we
>     use that dynamicaly read address to access another location).
>
>     Unsafe and MemoryAddress, by dropping bound checks, receive a
>     boost, and end up with similar numbers.
>
>     The first native reader, which is basically a copy of the code you
>     include is indeed slower than expected - it seems as slow as using
>     buffers or segment. (Note that running the included C++ benchmark
>     I get ~190ns/op).
>
>     I verified there's no GC allocation (-prof gc) and also tried to
>     profile (-prof perfasm), but I could only see (as you did) that
>     most time was spent in the assembly. So, I thought, what if the
>     native call is not the issue? What if the problem is in the
>     surrounding Java code, the one that looks for a terminator char?
>     So I looked at the C++ more closely and realized that the logic of
>     the loop was written in a slightly different way; basically we
>     have two loops there, an outer loop and an inner loop which only
>     focuses on finding the terminator. This makes the logic a bit more
>     linear. So, I've tried to copy that logic into `native_seek_alt`
>     and I got a massive boost.
>
>     Initially I thought it might have been an issue with the rewrital,
>     but then I put together a test which ran both the old and the new
>     versions and ensured that the native function was called (a) the
>     same number of times and (b) with the same parameters (and, if
>     not, throw an exception). And, lo and behold, no exceptions were
>     thrown.
>
>     This seems to get close to the results you were looking for. I
>     don't have a really good explanation as to why changing the outer
>     loop affects things so much. I imagine that the Java version of
>     the "findSeparator" routine is more easily optimized by C2 (since
>     it's a tight loop which accesses consecutive positions). I wonder
>     if the presence of the native call in the middle of the loop in
>     the base version adds some barriers which then make the sequential
>     scan for the separator less optimal. If I replace the call in the
>     "seek" function with your "noop" function, I go from 22ns/op to
>     10ns/op, which is ~2x boost. But, if I just do a loop where I call
>     "noop" in a loop 10M times, I get down to ~7ns/op. Which seems to
>     indicate that 7ns/op is (on my machine) the overhead for a native
>     call (which is compatible with our other microbenchmarks), and, in
>     the original version, 12ns were spent _outside_ the native call
>     (which, in turns, explains why changing the loop shape affects
>     things so much).
>
>     I also played a bit with removing native thread transition (as the
>     Java 19 allows to do so by tweaking some code). With that the
>     overhead per call goes down to 1.5ns/op and the overall result for
>     native_seek_alt is ~220ns/op. So, transitions make some difference
>     here (expected, given that we do native calls back to back), but
>     the overall impact of transitions on the throughput remains low.
>
>     I hope this helps.
>
>     Maurizio
>
>
>     On 05/02/2022 17:01, Cleber Muramoto wrote:
>>     Hello, I'm trying to understand the overhead of calling a downcall stub
>>     generated by jextract.
>>
>>     The source I'm using can be found on
>>     https://github.com/cmuramoto/test-jextract  <https://urldefense.com/v3/__https://github.com/cmuramoto/test-jextract__;!!ACWV5N9M2RV99hQ!euOQ-V2Z7x1g3n7Fe3cGb2XI3dwgLTKZJrFvCepPDtma4Vkc7blKTlb9k_jzIC3GjZGgXsQ$>
>>
>>     Basically, I have the following piece of code
>>
>>     MemorySegment data = loadFile(); // mmaps a text file with keywords, 1 per
>>     line (\n = separator)
>>     MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
>>     string->int
>>
>>     The trie is a contiguous array of pairs (int, int), which are fetched using
>>
>>     int base(MemorySegment ms,long offset) {
>>        return ms.get(JAVA_INT, offset<<3);
>>     }
>>
>>     int check(MemorySegment ms,long offset) {
>>        return ms.get(JAVA_INT, 4L + ( offset<<3));
>>     }
>>
>>     And the lookup method is as follows
>>
>>     // finds the value associated with the key data.slice(pos, end - pos)
>>     long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
>>        var from = 0L;
>>        var to = 0L;
>>
>>     while (pos < end) {
>>     to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
>>     if(check(trie, to) != from) {
>>     return ABSENT; // 1L<<33
>>     }
>>     from = to;
>>     pos++;
>>     }
>>
>>     to = base(trie, from);
>>     var check = check(trie, to);
>>
>>        if (check != i32(from)) {
>>          return NO_VALUE; // 1L << 32
>>        }
>>        return base(array, to);
>>     }
>>
>>     Running this code on a file with ~10 million lines with a fully populated
>>     trie (so that every lookup succeeds), I measured a peak throughput of
>>     391ns/query.
>>
>>     This lookup code suffers from bounds checking (which are unnecessary given
>>     how the trie constrains its 'jumps' from a previous to a current offset,
>>     but the JVM can't possibly know about this) and using Unsafe directly gives
>>     a pretty decent boost, but still can't match a C++ counterpart.
>>
>>     Based on the lookup code, I created an assembly version that receives the
>>     segment's raw addresses and the length of the string, compiled it with
>>     nasm, and created a shared library with ld:
>>
>>     nasm -felf64 trie.asm -o trie.o
>>     ld -z noexecstack -shared trie.o -o libtrie.so
>>
>>     Then generated the stubs with jextract:
>>
>>     jextract -t trie.asm --source -ltrie trie.h
>>
>>     trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
>>     used as a baseline to measure a 3-arg call overhead, which in my
>>     environment is about ~18ns/call.
>>
>>     A compiled C++ code linked against libtrie.so manages to achieve
>>     ~210ns/query.
>>
>>     My naive expectation was that on the Java side I would get, roughly, C
>>     throughput + 20ns ≃ 230ns/call, but so far I have the following:
>>
>>     (best out of 10 loops)
>>     jdk-17 branch
>>     -native - 288 ns/query
>>     -unsafe - 282 ns/query
>>
>>     jdk-19 branch
>>     -native - 372 ns/query
>>     -memory-segments - 391 ns/query
>>     -unsafe - 317 ns/query
>>
>>     I did not use JMH, but the tests were executed with EpsilonGC and the hot
>>     paths are allocation-free.
>>
>>     So I wonder, on the JVM side, is there any difference in the intrinsic call
>>     overhead when invoking a noop vs a real function (both with the same type
>>     of arguments)? If not, how can I profile the culprit? I tried
>>     async-profiler but the flame graph seems dominated by the assembly labels
>>     of libtrie.so (especially the inc3)
>>
>>     Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
>>
>>     Cheers!
>