<div dir="ltr"><div>> I also assume that the case when "there's nothing to read" is
common enough to make a difference?<br><br></div>Yes, I'd say the "nothing on the wire" is at least a three nines scenario but even in the presence of data in the NIC's rx ring the call will complete in low tens of nanos anyway so the overhead of JNI call matters in both cases.<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at 2:27 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com">maurizio.cimadamore@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks for the clarification, this is very helpful.</p>
<p>I also assume that the case when "there's nothing to read" is
common enough to make a difference?</p>
<p>Maurizio<br>
</p>
<p><br>
</p>
<div>On 04/07/2022 13:50, Wojciech Kudla
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>
<div>
<div>Hi Maurizio,<br>
<br>
</div>
You are correct that under normal circumstances sycalls
that are not supported by vDSO are very heavy but when we
call recvmsg/sendmsg we don't even perform a syscall at
all. High frequency trading shops employ kernel bypass for
all network flows pretty much by default. The most popular
solution here is OpenOnload used with Xilinix products.
For a case when there's nothing to read from the RX ring a
JavaCrtical JNI call to recvmsg completes in ~11ns vs 23ns
for a standard JNI call with full transition.<br>
</div>
Sorry, I've been in this for so long I kind of assumed it's
implied.<br>
<br>
</div>
Thanks,<br>
</div>
W.<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at 12:59
PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi,<br>
while I'm not an expert with some of the IO calls you
mention (some of my colleagues are more knowledgeable in
this area, so I'm sure they will have more info), my
general sense is that, as with getrusage, if there is a
system call involved, you already pay a hefty price for
the user to kernel transition. On my machine this seem to
cost around 200ns. In these cases, using JNI critical to
shave off a dozen of nanoseconds (at best!) seems just not
worth it.</p>
<p>So, of the functions in your list, the ones in which I
*believe* dropping transitions would have the most effect
are (if we exclude getpid, for which another approach is
possible) clock_gettime and getcpu, I believe, as they
might use vdso [1], which typically brings the performance
of these call closer to calls to shared lib functions.<br>
</p>
<p>If you have examples e.g. where performance of recvmsg
(or related calls) varies significantly between base JNI
and critical JNI, please send them our way; I'm sure some
of my colleagues would be intersted to take a look.<br>
</p>
<p>Popping back a couple of levels, I think it would be
helpful to also define what's an acceptable regression in
this context. Of course, in an ideal world, we'd like to
see no performance regression at all. But JNI critical is
an unsupported interface, which might misbehave with
modern garbage collectors (e.g. ZGC) and that requires
quite a bit of internal complexity which might, in the
medium/long run, hinder the evolution of the Java platform
(all these things have _some_ cost, even if the cost is
not directly material to developers). In this vein, I
think calls like clock_gettime tend to be more
problematic: as they complete very quickly, you see the
cost of transitions a lot more. In other cases, where
syscalls are involved, the cost associated to transitions
are more likely to be "in the noise". Of course if we look
at absolute numbers, dropping transitions would always
yield "faster" code; but at the same time, going from
250ns to 245ns is very unlikely to result in visible
performance difference when considering an application as
a whole, so I think it's critical here to decide _which_
use cases to prioritize.<br>
</p>
<p>I think a good outcome of this discussion would be if we
could come to some shared understanding of which native
calls are truly problematic (e.g. clock_gettime-like), and
then for the JDK to provide better (and more maintainable)
alternatives for those (which might even be faster than
using critical JNI).<br>
</p>
<p>Thanks<br>
Maurizio<br>
</p>
<p>[1] - <a href="https://urldefense.com/v3/__https://man7.org/linux/man-pages/man7/vdso.7.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJrLkUKL8$" target="_blank">https://man7.org/linux/man-pages/man7/vdso.7.html</a><br>
</p>
<div>On 04/07/2022 12:23, Wojciech Kudla wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>Thanks Maurizio,<br>
<br>
</div>
I raised this case mainly about clock_gettime and
recvmsg/sendmsg, I think we're focusing on the wrong
things here. Feel free to drop the two syscalls from
the discussion entirely, but the main usecases I have
been presenting throughout this thread definitely
stand.<br>
<br>
</div>
<div>Thanks<br>
</div>
<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at
10:54 AM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi Wojtek,<br>
thanks for sharing this list, I think this is a
good starting point to understand more about your
use case.</p>
<p>Last week I've been looking at "getrusage" (as
you mentioned it in an earlier email), and I was
surprised to see that the call took a pointer to a
(fairly big) struct which then needed to be
initialized with some thread-local state:</p>
<p><a href="https://urldefense.com/v3/__https://man7.org/linux/man-pages/man2/getrusage.2.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJXCLGiqw$" target="_blank">https://man7.org/linux/man-pages/man2/getrusage.2.html</a></p>
<p>I've looked at the implementation, and it seems
to be doing memset on the user-provided struct
pointer, plus all the fields assignment.
Eyeballing the implementation, this does not seem
to me like a "classic" use case where dropping
transition would help much. I mean, surely
dropping transitions would help shaving some
nanoseconds off the call, but it doesn't seem to
me that the call would be shortlived enough to
make a difference. Do you have some benchmarks on
this one? I did some [1] and the call overhead
seemed to come up at 260ns/op - w/o transition you
might perhaps be able to get to 250ns, but that's
in the noise?<br>
</p>
<p>As for getpid, note that you can do (since Java
9):<br>
<br>
ProcessHandle.current().pid();<br>
<br>
I believe the impl caches the result, so it
shouldn't even make the native call.<br>
</p>
<p>Maurizio</p>
<p>[1] - <a href="http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java" target="_blank">http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java</a><br>
</p>
<div>On 02/07/2022 07:42, Wojciech Kudla wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>Hi Maurizio,<br>
<br>
</div>
Thanks for staying on this.<br>
<br>
> Could you please provide a rough list of
the native calls you make where you believe
critical JNI is having a real impact in the
performance of your application?<br>
</div>
<div><br>
From the top of my head:<br>
</div>
<div>clock_gettime<br>
</div>
<div>recvmsg<br>
</div>
<div>recvmmsg</div>
<div>sendmsg<br>
</div>
<div>sendmmsg</div>
<div>select<br>
</div>
<div>getpid</div>
<div>getcpu<br>
</div>
<div>getrusage<br>
</div>
<div><br>
</div>
<div>> Also, could you please tell us whether
any of these calls need to interact with Java
arrays?<br>
</div>
<div>No arrays or objects of any type involved.
Everything happens by the means of passing raw
pointers as longs and using other primitive
types as function arguments.<br>
</div>
<div><br>
> In other words, do you use critical JNI
to remove the cost associated with thread
transitions, or are you also taking advantage
of accessing on-heap memory _directly_ from
native code?<br>
</div>
<div>Criticial JNI natives are used solely to
remove the cost of transitions. We don't get
anywhere near java heap in native code.<br>
<br>
</div>
<div>In general I think it makes a lot of sense
for Java as a language/platform to have some
guards around unsafe code, but on the other
hand the popularity of libraries employing
Unsafe and their success in more
performance-oriented corners of software
engineering is a clear indicator there is a
need for the JVM to provide access to more
low-level primitives and mechanisms. <br>
</div>
<div>I think it's entirely fair to tell
developers that all bets are off when they get
into some non-idiomatic scenarios but please
don't take away a feature that greatly
contributed to Java's success.<br>
<br>
</div>
<div>Kind regards,<br>
</div>
<div>Wojtek<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Jun
29, 2022 at 5:20 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi Wojciech,<br>
picking up this thread again. After some
internal discussion, we realize that we
don't know enough about your use case.
While re-enabling JNI critical would
obviously provide a quick fix, we're
afraid that (a) developers might end up
depending on JNI critical when they don't
need to (perhaps also unaware of the
consequences of depending on it) and (b)
that there might actually be _better_ (as
in: much faster) solutions than using
critical native calls to address at least
some of your use cases (that seemed to be
the case with the clock_gettime example
you mentioned). Could you please provide a
rough list of the native calls you make
where you believe critical JNI is having a
real impact in the performance of your
application? Also, could you please tell
us whether any of these calls need to
interact with Java arrays? In other words,
do you use critical JNI to remove the cost
associated with thread transitions, or are
you also taking advantage of accessing
on-heap memory _directly_ from native
code?</p>
<p>Regards<br>
Maurizio<br>
</p>
<div>On 13/06/2022 21:38, Wojciech Kudla
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>
<div>
<div>
<div>Hi Mark,<br>
<br>
</div>
Thanks for your input and
apologies for the delayed
response.<br>
<br>
> If the platform included,
say, an intrinsified
System.nanoRealTime()<br>
method that returned
clock_gettime(CLOCK_REALTIME),
how much would<br>
that help developers in your
unnamed industry?<br>
<br>
</div>
Exposing realtime clock with
nanosecond granularity in the JDK
would be a great step forward. I
should have made it clear that I
represent fintech corner
(investment banking to be exact)
but the issues my message touches
upon span areas such as HPC, audio
processing, gaming, and defense
industry so it's not like we have
an isolated case.<br>
<br>
> In a similar vein, if people
are finding it necessary to
“replace parts<br>
of NIO with hand-crafted native
code” then it would be interesting
to<br>
understand what their requirements
are<br>
<br>
</div>
As for the other example I provided
with making very short lived
syscalls such as recvmsg/recvmmsg
the premise is getting access to
hardware timestamps on the ingress
and egress ends as well as enabling
batch receive with a single syscall
and otherwise exploiting features
unavailable from the JDK (like
access to CMSG interface,
scatter/gather, etc).<br>
</div>
<div>There are also other examples of
calls that we'd love to make often
and at lowest possible cost (ie.
getrusage) but I'm not sure if
there's a strong case for some of
these ideas, that's why it might be
worth looking into more generic
approach for performance sensitive
code.<br>
</div>
<div>Hope this does better job at
explaining where we're coming from
than my previous messages.<br>
</div>
<div><br>
</div>
Thanks,<br>
</div>
W<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Tue, Jun 7, 2022 at 6:31 PM <<a href="mailto:mark.reinhold@oracle.com" target="_blank">mark.reinhold@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">2022/6/6
0:24:17 -0700, <a href="mailto:wkudla.kernel@gmail.com" target="_blank">wkudla.kernel@gmail.com</a>:<br>
>> Yes for System.nanoTime(),
but System.currentTimeMillis() reports<br>
>> CLOCK_REALTIME.<br>
> <br>
> Unfortunately
System.currentTimeMillis() offers only
millisecond<br>
> granularity which is the reason
why our industry has to resort to<br>
> clock_gettime.<br>
<br>
If the platform included, say, an
intrinsified System.nanoRealTime()<br>
method that returned
clock_gettime(CLOCK_REALTIME), how
much would<br>
that help developers in your unnamed
industry?<br>
<br>
In a similar vein, if people are
finding it necessary to “replace parts<br>
of NIO with hand-crafted native code”
then it would be interesting to<br>
understand what their requirements
are. Some simple enhancements to<br>
the NIO API would be much less costly
to design and implement than a<br>
generalized user-level native-call
intrinsification mechanism.<br>
<br>
- Mark<br>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote></div>