[foreign] Mesuring downcall stub performance
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Feb 8 16:11:15 UTC 2022
Hi,
I played with the benchmark some more, and I ended up converting it into
a single JMH benchmark, extending it and adding other variants (e.g.
byte buffer, memory address).
The benchmark can be found here:
http://cr.openjdk.java.net/~mcimadamore/panama/TrieBenchmark.java
The numbers I get out of JMH are similar to those generated by your
harness. The numbers I got are below (tested against latest Java 19):
Benchmark Mode Cnt Score Error Units
TrieBenchmark.address_seek avgt 30 2561.813 ± 24.697 ms/op
TrieBenchmark.buffer_seek avgt 30 3402.664 ± 32.409 ms/op
TrieBenchmark.native_seek avgt 30 3251.644 ± 48.709 ms/op
TrieBenchmark.native_seek_alt avgt 30 2405.684 ± 14.005 ms/op
TrieBenchmark.segment_seek avgt 30 3390.148 ± 15.195 ms/op
TrieBenchmark.unsafe_seek avgt 30 2537.085 ± 20.125 ms/op
(note that each "op" in JMH is a fulll round of 10M seek, so if you want
to find the avg time per query in nanoseconds, you can just divide the
above number by 10).
So, both ByteBuffer and MemorySegment are the slowest. This is, as you
posited, because of the bound checks. There's no way C2 can optimize
those checks as the pattern of access is essentially unscrutable by C2
(e.g. we read an address from a segment and we use that dynamicaly read
address to access another location).
Unsafe and MemoryAddress, by dropping bound checks, receive a boost, and
end up with similar numbers.
The first native reader, which is basically a copy of the code you
include is indeed slower than expected - it seems as slow as using
buffers or segment. (Note that running the included C++ benchmark I get
~190ns/op).
I verified there's no GC allocation (-prof gc) and also tried to profile
(-prof perfasm), but I could only see (as you did) that most time was
spent in the assembly. So, I thought, what if the native call is not the
issue? What if the problem is in the surrounding Java code, the one that
looks for a terminator char?
So I looked at the C++ more closely and realized that the logic of the
loop was written in a slightly different way; basically we have two
loops there, an outer loop and an inner loop which only focuses on
finding the terminator. This makes the logic a bit more linear. So, I've
tried to copy that logic into `native_seek_alt` and I got a massive boost.
Initially I thought it might have been an issue with the rewrital, but
then I put together a test which ran both the old and the new versions
and ensured that the native function was called (a) the same number of
times and (b) with the same parameters (and, if not, throw an
exception). And, lo and behold, no exceptions were thrown.
This seems to get close to the results you were looking for. I don't
have a really good explanation as to why changing the outer loop affects
things so much. I imagine that the Java version of the "findSeparator"
routine is more easily optimized by C2 (since it's a tight loop which
accesses consecutive positions). I wonder if the presence of the native
call in the middle of the loop in the base version adds some barriers
which then make the sequential scan for the separator less optimal. If I
replace the call in the "seek" function with your "noop" function, I go
from 22ns/op to 10ns/op, which is ~2x boost. But, if I just do a loop
where I call "noop" in a loop 10M times, I get down to ~7ns/op. Which
seems to indicate that 7ns/op is (on my machine) the overhead for a
native call (which is compatible with our other microbenchmarks), and,
in the original version, 12ns were spent _outside_ the native call
(which, in turns, explains why changing the loop shape affects things so
much).
I also played a bit with removing native thread transition (as the Java
19 allows to do so by tweaking some code). With that the overhead per
call goes down to 1.5ns/op and the overall result for native_seek_alt is
~220ns/op. So, transitions make some difference here (expected, given
that we do native calls back to back), but the overall impact of
transitions on the throughput remains low.
I hope this helps.
Maurizio
On 05/02/2022 17:01, Cleber Muramoto wrote:
> Hello, I'm trying to understand the overhead of calling a downcall stub
> generated by jextract.
>
> The source I'm using can be found on
> https://github.com/cmuramoto/test-jextract
>
> Basically, I have the following piece of code
>
> MemorySegment data = loadFile(); // mmaps a text file with keywords, 1 per
> line (\n = separator)
> MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
> string->int
>
> The trie is a contiguous array of pairs (int, int), which are fetched using
>
> int base(MemorySegment ms,long offset) {
> return ms.get(JAVA_INT, offset<<3);
> }
>
> int check(MemorySegment ms,long offset) {
> return ms.get(JAVA_INT, 4L + ( offset<<3));
> }
>
> And the lookup method is as follows
>
> // finds the value associated with the key data.slice(pos, end - pos)
> long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
> var from = 0L;
> var to = 0L;
>
> while (pos < end) {
> to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
> if(check(trie, to) != from) {
> return ABSENT; // 1L<<33
> }
> from = to;
> pos++;
> }
>
> to = base(trie, from);
> var check = check(trie, to);
>
> if (check != i32(from)) {
> return NO_VALUE; // 1L << 32
> }
> return base(array, to);
> }
>
> Running this code on a file with ~10 million lines with a fully populated
> trie (so that every lookup succeeds), I measured a peak throughput of
> 391ns/query.
>
> This lookup code suffers from bounds checking (which are unnecessary given
> how the trie constrains its 'jumps' from a previous to a current offset,
> but the JVM can't possibly know about this) and using Unsafe directly gives
> a pretty decent boost, but still can't match a C++ counterpart.
>
> Based on the lookup code, I created an assembly version that receives the
> segment's raw addresses and the length of the string, compiled it with
> nasm, and created a shared library with ld:
>
> nasm -felf64 trie.asm -o trie.o
> ld -z noexecstack -shared trie.o -o libtrie.so
>
> Then generated the stubs with jextract:
>
> jextract -t trie.asm --source -ltrie trie.h
>
> trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
> used as a baseline to measure a 3-arg call overhead, which in my
> environment is about ~18ns/call.
>
> A compiled C++ code linked against libtrie.so manages to achieve
> ~210ns/query.
>
> My naive expectation was that on the Java side I would get, roughly, C
> throughput + 20ns ≃ 230ns/call, but so far I have the following:
>
> (best out of 10 loops)
> jdk-17 branch
> -native - 288 ns/query
> -unsafe - 282 ns/query
>
> jdk-19 branch
> -native - 372 ns/query
> -memory-segments - 391 ns/query
> -unsafe - 317 ns/query
>
> I did not use JMH, but the tests were executed with EpsilonGC and the hot
> paths are allocation-free.
>
> So I wonder, on the JVM side, is there any difference in the intrinsic call
> overhead when invoking a noop vs a real function (both with the same type
> of arguments)? If not, how can I profile the culprit? I tried
> async-profiler but the flame graph seems dominated by the assembly labels
> of libtrie.so (especially the inc3)
>
> Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
>
> Cheers!
More information about the panama-dev
mailing list