[foreign] Mesuring downcall stub performance

Mon Feb 7 12:12:50 UTC 2022

Hi,

This is just a wild guess.

I wonder if the issues can be related to the usage of shared scopes, which use the volatile, loops with volatile may be hard to optimize.

Did you consider using non-shared scopes (or derive somehow confined scopes from the shared scope in case of true multithreaded application)?

Best regards,
Rado
________________________________
From: panama-dev <panama-dev-retn at openjdk.java.net> on behalf of Maurizio Cimadamore <maurizio.cimadamore at oracle.com>
Sent: Monday, February 7, 2022 12:56
To: Cleber Muramoto <cleber.muramoto at gmail.com>; panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
Subject: Re: [foreign] Mesuring downcall stub performance

Foregot to include a link:

[1] -
https://mail.openjdk.java.net/pipermail/panama-dev/2021-December/015915.html

Btw, I tried your benchmark, but it crashes left and right on my machine
(Linux, x64).

Seems like some out of bound memory access - I get an exception in the
MemorySegment test and a plain crash when using unsafe. It seems like
calls to base() end up returning very big offsets which are then out of
bounds in the segments where they are used (the data segment is 133
bytes, while the trie file is 134 bytes). Not sure if something got
truncated.

Maurizio

On 07/02/2022 11:16, Maurizio Cimadamore wrote:
> Hi,
> your estimate of 18ns/call seems accurate. We do have JMH benchmark
> which measures no-op calls and results are similar.
>
> We'd need to investigate re. bound checks (but there is some work
> going on in the area, see [1]) and, more importantly, performance
> regression w.r.t. 17.
>
> There are no VM changes from 17 to 18 - the onlt thing that changed
> was that now we also check by-reference parameters in downcall for
> liveness - but in this case you're using MemoryAddress parameters, so
> your code is not impacted by that.
>
> It's also interesting that the Unsafe path got slower as well in the
> transition from 17 to 19, which is odd.
>
> It would also help if the native benchmark had correspoding C/C++
> code, for portability.
>
> Thanks
> Maurizio
>
> On 05/02/2022 17:01, Cleber Muramoto wrote:
>> Hello, I'm trying to understand the overhead of calling a downcall stub
>> generated by jextract.
>>
>> The source I'm using can be found on
>> https://github.com/cmuramoto/test-jextract
>>
>> Basically, I have the following piece of code
>>
>> MemorySegment data = loadFile(); // mmaps a text file with keywords,
>> 1 per
>> line (\n = separator)
>> MemorySegment trie = loadTrie(); // mmaps a trie data structure that
>> maps
>> string->int
>>
>> The trie is a contiguous array of pairs (int, int), which are fetched
>> using
>>
>> int base(MemorySegment ms,long offset) {
>>    return ms.get(JAVA_INT, offset<<3);
>> }
>>
>> int check(MemorySegment ms,long offset) {
>>    return ms.get(JAVA_INT, 4L + ( offset<<3));
>> }
>>
>> And the lookup method is as follows
>>
>> // finds the value associated with the key data.slice(pos, end - pos)
>> long lookup(MemorySegment trie, MemorySegment data, int pos, int end){
>>    var from = 0L;
>>    var to = 0L;
>>
>> while (pos < end) {
>> to = ((long)(base(trie, from))) ^ (data.get(JAVA_BYTE,pos) & 0xFF);
>> if(check(trie, to) != from) {
>> return ABSENT; // 1L<<33
>> }
>> from = to;
>> pos++;
>> }
>>
>> to = base(trie, from);
>> var check = check(trie, to);
>>
>>    if (check != i32(from)) {
>>      return NO_VALUE; // 1L << 32
>>    }
>>    return base(array, to);
>> }
>>
>> Running this code on a file with ~10 million lines with a fully
>> populated
>> trie (so that every lookup succeeds), I measured a peak throughput of
>> 391ns/query.
>>
>> This lookup code suffers from bounds checking (which are unnecessary
>> given
>> how the trie constrains its 'jumps' from a previous to a current offset,
>> but the JVM can't possibly know about this) and using Unsafe directly
>> gives
>> a pretty decent boost, but still can't match a C++ counterpart.
>>
>> Based on the lookup code, I created an assembly version that receives
>> the
>> segment's raw addresses and the length of the string, compiled it with
>> nasm, and created a shared library with ld:
>>
>> nasm -felf64 trie.asm -o trie.o
>> ld -z noexecstack -shared trie.o -o libtrie.so
>>
>> Then generated the stubs with jextract:
>>
>> jextract -t trie.asm --source -ltrie trie.h
>>
>> trie.h declares de lookup function and a noop ( xor rax, rax ret)
>> which I
>> used as a baseline to measure a 3-arg call overhead, which in my
>> environment is about ~18ns/call.
>>
>> A compiled C++ code linked against libtrie.so manages to achieve
>> ~210ns/query.
>>
>> My naive expectation was that on the Java side I would get, roughly, C
>> throughput + 20ns ≃ 230ns/call, but so far I have the following:
>>
>> (best out of 10 loops)
>> jdk-17 branch
>> -native - 288 ns/query
>> -unsafe - 282 ns/query
>>
>> jdk-19 branch
>> -native - 372 ns/query
>> -memory-segments - 391 ns/query
>> -unsafe - 317 ns/query
>>
>> I did not use JMH, but the tests were executed with EpsilonGC and the
>> hot
>> paths are allocation-free.
>>
>> So I wonder, on the JVM side, is there any difference in the
>> intrinsic call
>> overhead when invoking a noop vs a real function (both with the same
>> type
>> of arguments)? If not, how can I profile the culprit? I tried
>> async-profiler but the flame graph seems dominated by the assembly
>> labels
>> of libtrie.so (especially the inc3)
>>
>> Is the ~30% degradation of jdk-19 when compared to jdk-17 expected?
>>
>> Cheers!