[foreign] Mesuring downcall stub performance
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Feb 8 18:22:16 UTC 2022
Hi,
this is the crude patch I used to disable state transition.
diff --git
a/src/java.base/share/classes/jdk/internal/invoke/NativeEntryPoint.java
b/src/java.base/share/classes/jdk/internal/invoke/NativeEntryPoint.java
index 12e832a582d..2b027d5186e 100644
--- a/src/java.base/share/classes/jdk/internal/invoke/NativeEntryPoint.java
+++ b/src/java.base/share/classes/jdk/internal/invoke/NativeEntryPoint.java
@@ -52,7 +52,7 @@ public class NativeEntryPoint {
this.shadowSpace = shadowSpace;
this.argMoves = Objects.requireNonNull(argMoves);
this.returnMoves = Objects.requireNonNull(returnMoves);
- this.needTransition = needTransition;
+ this.needTransition = false;
this.methodType = methodType;
this.name = name;
}
The C linker machinery is able to omit state transitions - but we don't
currently expose a way for developers to select that behavior (we might
in the future). In general, unless you are doing pretty specific stuff,
the cost of the transition (~5ns) is negligible compared to everything
else going on in the application.
But of course if you want to compare Java code with C++, that cost has
to be taken into the equation as well (which is why I tried, just for
fun, to see how close we could get).
Maurizio
On 08/02/2022 17:47, Cleber Muramoto wrote:
> Thank you Maurizio, nice catch on the loop shape logic!
>
> How did you remove the native thread transition? Does it require
> modifying and recompiling the JDK?
>
> I put together some JMH benchmarks as well using
> @BenchmarkMode(Mode.AverageTime) and created a jdk-17 branch. Results
> are very similar, probably I was messing something up in my previous
> comparison.
>
> # VM version: JDK 17-panama, OpenJDK 64-Bit Server VM, 17-panama+3-167
> Benchmark Mode Cnt Score Error Units
> EstimateCallOverhead.run avgt 3 9.202 ± 0.145 ns/op
> Reader.run avgt 3 353.020 ± 20.584 ns/op
> ReaderNative.run avgt 3 281.826 ± 3.419 ns/op
> ReaderUnsafe.run avgt 3 298.737 ± 21.465 ns/op
>
> # VM version: JDK 19-panama, OpenJDK 64-Bit Server VM, 19-panama+1-13
> Benchmark Mode Cnt Score Error Units
> EstimateCallOverhead.run avgt 3 9.778 ± 0.348 ns/op
> Reader.run avgt 3 356.468 ± 22.297 ns/op
> ReaderNative.run avgt 3 289.500 ± 8.928 ns/op
> ReaderUnsafe.run avgt 3 293.697 ± 3.299 ns/op
>
>
> Regards
>
> On Tue, Feb 8, 2022 at 1:11 PM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>
> Hi,
> I played with the benchmark some more, and I ended up converting
> it into a single JMH benchmark, extending it and adding other
> variants (e.g. byte buffer, memory address).
>
> The benchmark can be found here:
>
> http://cr.openjdk.java.net/~mcimadamore/panama/TrieBenchmark.java
>
> The numbers I get out of JMH are similar to those generated by
> your harness. The numbers I got are below (tested against latest
> Java 19):
>
> Benchmark Mode Cnt Score Error Units
> TrieBenchmark.address_seek avgt 30 2561.813 ± 24.697 ms/op
> TrieBenchmark.buffer_seek avgt 30 3402.664 ± 32.409 ms/op
> TrieBenchmark.native_seek avgt 30 3251.644 ± 48.709 ms/op
> TrieBenchmark.native_seek_alt avgt 30 2405.684 ± 14.005 ms/op
> TrieBenchmark.segment_seek avgt 30 3390.148 ± 15.195 ms/op
> TrieBenchmark.unsafe_seek avgt 30 2537.085 ± 20.125 ms/op
>
>
> (note that each "op" in JMH is a fulll round of 10M seek, so if
> you want to find the avg time per query in nanoseconds, you can
> just divide the above number by 10).
>
> So, both ByteBuffer and MemorySegment are the slowest. This is, as
> you posited, because of the bound checks. There's no way C2 can
> optimize those checks as the pattern of access is essentially
> unscrutable by C2 (e.g. we read an address from a segment and we
> use that dynamicaly read address to access another location).
>
> Unsafe and MemoryAddress, by dropping bound checks, receive a
> boost, and end up with similar numbers.
>
> The first native reader, which is basically a copy of the code you
> include is indeed slower than expected - it seems as slow as using
> buffers or segment. (Note that running the included C++ benchmark
> I get ~190ns/op).
>
> I verified there's no GC allocation (-prof gc) and also tried to
> profile (-prof perfasm), but I could only see (as you did) that
> most time was spent in the assembly. So, I thought, what if the
> native call is not the issue? What if the problem is in the
> surrounding Java code, the one that looks for a terminator char?
> So I looked at the C++ more closely and realized that the logic of
> the loop was written in a slightly different way; basically we
> have two loops there, an outer loop and an inner loop which only
> focuses on finding the terminator. This makes the logic a bit more
> linear. So, I've tried to copy that logic into `native_seek_alt`
> and I got a massive boost.
>
> Initially I thought it might have been an issue with the rewrital,
> but then I put together a test which ran both the old and the new
> versions and ensured that the native function was called (a) the
> same number of times and (b) with the same parameters (and, if
> not, throw an exception). And, lo and behold, no exceptions were
> thrown.
>
> This seems to get close to the results you were looking for. I
> don't have a really good explanation as to why changing the outer
> loop affects things so much. I imagine that the Java version of
> the "findSeparator" routine is more easily optimized by C2 (since
> it's a tight loop which accesses consecutive positions). I wonder
> if the presence of the native call in the middle of the loop in
> the base version adds some barriers which then make the sequential
> scan for the separator less optimal. If I replace the call in the
> "seek" function with your "noop" function, I go from 22ns/op to
> 10ns/op, which is ~2x boost. But, if I just do a loop where I call
> "noop" in a loop 10M times, I get down to ~7ns/op. Which seems to
> indicate that 7ns/op is (on my machine) the overhead for a native
> call (which is compatible with our other microbenchmarks), and, in
> the original version, 12ns were spent _outside_ the native call
> (which, in turns, explains why changing the loop shape affects
> things so much).
>
> I also played a bit with removing native thread transition (as the
> Java 19 allows to do so by tweaking some code). With that the
> overhead per call goes down to 1.5ns/op and the overall result for
> native_seek_alt is ~220ns/op. So, transitions make some difference
> here (expected, given that we do native calls back to back), but
> the overall impact of transitions on the throughput remains low.
>
> I hope this helps.
>
> Maurizio
>
>
> On 05/02/2022 17:01, Cleber Muramoto wrote:
>> Hello, I'm trying to understand the overhead of calling a downcall stub
>> generated by jextract.
>>
>> The source I'm using can be found on
>> https://github.com/cmuramoto/test-jextract <https://urldefense.com/v3/__https://github.com/cmuramoto/test-jextract__;!!ACWV5N9M2RV99hQ!euOQ-V2Z7x1g3n7Fe3cGb2XI3dwgLTKZJrFvCepPDtma4Vkc7blKTlb9k_jzIC3GjZGgXsQ$>
>>
>> Basically, I have the following piece of code
>>
>> MemorySegment data = loadFile(); // mmaps a text file with keywords, 1 per
>> line (\n = separator)
>> MemorySegment trie = loadTrie(); // mmaps a trie data structure that maps
>> string->int
>>
>> The trie is a contiguous array of pairs (int, int), which are fetched using
>>
>> int base(MemorySegment ms,long offset) {
>> return ms.get(JAVA_INT, offset<<3);
>> }
>>
>> int check(MemorySegment ms,long offset) {
>> return ms.get(JAVA_INT, 4L + ( offset<<3));
>> }
>>
>> And the lookup method is as follows
>>
>> // finds the value associated with the key data.slice(pos, end - pos)
>> long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
>> var from = 0L;
>> var to = 0L;
>>
>> while (pos < end) {
>> to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
>> if(check(trie, to) != from) {
>> return ABSENT; // 1L<<33
>> }
>> from = to;
>> pos++;
>> }
>>
>> to = base(trie, from);
>> var check = check(trie, to);
>>
>> if (check != i32(from)) {
>> return NO_VALUE; // 1L << 32
>> }
>> return base(array, to);
>> }
>>
>> Running this code on a file with ~10 million lines with a fully populated
>> trie (so that every lookup succeeds), I measured a peak throughput of
>> 391ns/query.
>>
>> This lookup code suffers from bounds checking (which are unnecessary given
>> how the trie constrains its 'jumps' from a previous to a current offset,
>> but the JVM can't possibly know about this) and using Unsafe directly gives
>> a pretty decent boost, but still can't match a C++ counterpart.
>>
>> Based on the lookup code, I created an assembly version that receives the
>> segment's raw addresses and the length of the string, compiled it with
>> nasm, and created a shared library with ld:
>>
>> nasm -felf64 trie.asm -o trie.o
>> ld -z noexecstack -shared trie.o -o libtrie.so
>>
>> Then generated the stubs with jextract:
>>
>> jextract -t trie.asm --source -ltrie trie.h
>>
>> trie.h declares de lookup function and a noop ( xor rax, rax ret) which I
>> used as a baseline to measure a 3-arg call overhead, which in my
>> environment is about ~18ns/call.
>>
>> A compiled C++ code linked against libtrie.so manages to achieve
>> ~210ns/query.
>>
>> My naive expectation was that on the Java side I would get, roughly, C
>> throughput + 20ns ≃ 230ns/call, but so far I have the following:
>>
>> (best out of 10 loops)
>> jdk-17 branch
>> -native - 288 ns/query
>> -unsafe - 282 ns/query
>>
>> jdk-19 branch
>> -native - 372 ns/query
>> -memory-segments - 391 ns/query
>> -unsafe - 317 ns/query
>>
>> I did not use JMH, but the tests were executed with EpsilonGC and the hot
>> paths are allocation-free.
>>
>> So I wonder, on the JVM side, is there any difference in the intrinsic call
>> overhead when invoking a noop vs a real function (both with the same type
>> of arguments)? If not, how can I profile the culprit? I tried
>> async-profiler but the flame graph seems dominated by the assembly labels
>> of libtrie.so (especially the inc3)
>>
>> Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
>>
>> Cheers!
>
More information about the panama-dev
mailing list