[foreign] Mesuring downcall stub performance
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Feb 7 13:55:34 UTC 2022
On 07/02/2022 13:36, Cleber Muramoto wrote:
> Hi Maurizio.
>
> Probably git LFS messed up. The keys and trie files should be 95MB and
> 280MB, respectively.
I downloaded the file directly from github and it now works.
Here's what I get:
unsafe
ns/q: 335.05
native:
ns/q: 339.13
segment:
ns/q: 329.15
(for each, I only listed the best result - not sure how to parse all the
numbers the benchmark generates, so I only used the very first one :-)).
In general, using the latest Panama EA, I couldn't really tell the three
benchmark apart, and, also, I noticed that numbers can vary quite a bit
from run to run.
Maurizio
>
> You'll find the files in this drive link:
>
> https://drive.google.com/drive/folders/1x-imW6MLZNdcJSolmG2B0B06OCYKLU8i
> <https://urldefense.com/v3/__https://drive.google.com/drive/folders/1x-imW6MLZNdcJSolmG2B0B06OCYKLU8i__;!!ACWV5N9M2RV99hQ!egIK3stNh2Ta57ziYL8r5zD5jXjDvwQjBWZJ1fIG-uUgkm7t0jJQ4ptAqNHBAyqNOfS1PTI$>
>
> The benchmark code for C++ can be found under native folder
> (bench.cpp), just run
>
> ./compile.bench.sh
> <https://urldefense.com/v3/__http://compile.bench.sh__;!!ACWV5N9M2RV99hQ!egIK3stNh2Ta57ziYL8r5zD5jXjDvwQjBWZJ1fIG-uUgkm7t0jJQ4ptAqNHBAyqNlkCR9rM$>
> shared
> ./run.shared.sh
> <https://urldefense.com/v3/__http://run.shared.sh__;!!ACWV5N9M2RV99hQ!egIK3stNh2Ta57ziYL8r5zD5jXjDvwQjBWZJ1fIG-uUgkm7t0jJQ4ptAqNHBAyqNwUMRFz0$>
>
> I found that odd too that Unsafe got slower in comparison to jdk-17.
> I'll create a branch with the corresponding code.
>
> Regards
>
> On Mon, Feb 7, 2022 at 8:16 AM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>
> Hi,
> your estimate of 18ns/call seems accurate. We do have JMH benchmark
> which measures no-op calls and results are similar.
>
> We'd need to investigate re. bound checks (but there is some work
> going
> on in the area, see [1]) and, more importantly, performance
> regression
> w.r.t. 17.
>
> There are no VM changes from 17 to 18 - the onlt thing that
> changed was
> that now we also check by-reference parameters in downcall for
> liveness
> - but in this case you're using MemoryAddress parameters, so your
> code
> is not impacted by that.
>
> It's also interesting that the Unsafe path got slower as well in the
> transition from 17 to 19, which is odd.
>
> It would also help if the native benchmark had correspoding C/C++
> code,
> for portability.
>
> Thanks
> Maurizio
>
> On 05/02/2022 17:01, Cleber Muramoto wrote:
> > Hello, I'm trying to understand the overhead of calling a
> downcall stub
> > generated by jextract.
> >
> > The source I'm using can be found on
> > https://github.com/cmuramoto/test-jextract
> <https://urldefense.com/v3/__https://github.com/cmuramoto/test-jextract__;!!ACWV5N9M2RV99hQ!egIK3stNh2Ta57ziYL8r5zD5jXjDvwQjBWZJ1fIG-uUgkm7t0jJQ4ptAqNHBAyqNvj7t7fE$>
> >
> > Basically, I have the following piece of code
> >
> > MemorySegment data = loadFile(); // mmaps a text file with
> keywords, 1 per
> > line (\n = separator)
> > MemorySegment trie = loadTrie(); // mmaps a trie data structure
> that maps
> > string->int
> >
> > The trie is a contiguous array of pairs (int, int), which are
> fetched using
> >
> > int base(MemorySegment ms,long offset) {
> > return ms.get(JAVA_INT, offset<<3);
> > }
> >
> > int check(MemorySegment ms,long offset) {
> > return ms.get(JAVA_INT, 4L + ( offset<<3));
> > }
> >
> > And the lookup method is as follows
> >
> > // finds the value associated with the key data.slice(pos, end -
> pos)
> > long lookup(MemorySegment trie, MemorySegment data, int pos, int
> end){
> > var from = 0L;
> > var to = 0L;
> >
> > while (pos < end) {
> > to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
> > if(check(trie, to) != from) {
> > return ABSENT; // 1L<<33
> > }
> > from = to;
> > pos++;
> > }
> >
> > to = base(trie, from);
> > var check = check(trie, to);
> >
> > if (check != i32(from)) {
> > return NO_VALUE; // 1L << 32
> > }
> > return base(array, to);
> > }
> >
> > Running this code on a file with ~10 million lines with a fully
> populated
> > trie (so that every lookup succeeds), I measured a peak
> throughput of
> > 391ns/query.
> >
> > This lookup code suffers from bounds checking (which are
> unnecessary given
> > how the trie constrains its 'jumps' from a previous to a current
> offset,
> > but the JVM can't possibly know about this) and using Unsafe
> directly gives
> > a pretty decent boost, but still can't match a C++ counterpart.
> >
> > Based on the lookup code, I created an assembly version that
> receives the
> > segment's raw addresses and the length of the string, compiled
> it with
> > nasm, and created a shared library with ld:
> >
> > nasm -felf64 trie.asm -o trie.o
> > ld -z noexecstack -shared trie.o -o libtrie.so
> >
> > Then generated the stubs with jextract:
> >
> > jextract -t trie.asm --source -ltrie trie.h
> >
> > trie.h declares de lookup function and a noop ( xor rax, rax
> ret) which I
> > used as a baseline to measure a 3-arg call overhead, which in my
> > environment is about ~18ns/call.
> >
> > A compiled C++ code linked against libtrie.so manages to achieve
> > ~210ns/query.
> >
> > My naive expectation was that on the Java side I would get,
> roughly, C
> > throughput + 20ns ≃ 230ns/call, but so far I have the following:
> >
> > (best out of 10 loops)
> > jdk-17 branch
> > -native - 288 ns/query
> > -unsafe - 282 ns/query
> >
> > jdk-19 branch
> > -native - 372 ns/query
> > -memory-segments - 391 ns/query
> > -unsafe - 317 ns/query
> >
> > I did not use JMH, but the tests were executed with EpsilonGC
> and the hot
> > paths are allocation-free.
> >
> > So I wonder, on the JVM side, is there any difference in the
> intrinsic call
> > overhead when invoking a noop vs a real function (both with the
> same type
> > of arguments)? If not, how can I profile the culprit? I tried
> > async-profiler but the flame graph seems dominated by the
> assembly labels
> > of libtrie.so (especially the inc3)
> >
> > Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
> >
> > Cheers!
>
More information about the panama-dev
mailing list