<div><br></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at 1:13 PM Wojciech Kudla <<a href="mailto:wkudla.kernel@gmail.com">wkudla.kernel@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)"><div dir="ltr"><div><div><div>Thanks for your input, Vitaly. I'd be interested to find out more about the nature of the HW noise you observed in your benchmarks as our results were very consistent and it was pretty straightforward to pinpoint the culprit as JNI call overhead. Maybe it was just easier for us because we disallow C- and P-state transitions and put a lot of effort to eliminate platform jitter in general. Were you maybe running on a CPU model that doesn't support constant TSC? I would also suggest retrying with LAPIC interrupts suppressed (with: cli/sti) to maybe see if it's the kernel and not the hardware.</div></div></div></div></blockquote><div dir="auto">This was on a Broadwell Xeon chipset with constant tsc.  All the typical jitter sources were reduced: C/P states disabled in bios, max turbo enabled, IRQs steered away, core isolated, etc.  By the way, by noise I don’t mean the results themselves were noisy - they were constant run to run.  I just meant the delta between normal vs critical JNI entrypoints was very minimal - ie “in the noise”, particularly with rdtsc.</div><div dir="auto"><br></div><div dir="auto">I can try to remeasure on newer Intel but see below …</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)"><div dir="ltr"><div><div><div dir="auto"><br><br></div>100% agree on rdtsc(p) and snippets. There are some narrow usecases were one can get some substantial speed ups with direct access to prefetch or by abusing misprediction to keep icache hot. These scenarios are sadly only available with inline assembly. I know of a few shops that go to the length of forking Graal, etc to achieve that but am quite convinced such capabilities would be welcome and utilized by many more groups if they were easily accessible from java.</div></div></div></blockquote><div dir="auto">I’m of the firm (and perhaps controversial for some :)) opinion these days that Java is simply the wrong platform/tool for low latency cases that warrant this level of control.  There’re very strong headwinds even outside of JNI costs.  And the “real” problem with JNI, besides transition costs, is lack of inlining into the native calls.  So even if JVM transition costs are fully eliminated, there’s still an optimization fence due to lost inlining (not unlike native code calling native fns via shared libs).</div><div dir="auto"><br></div><div dir="auto">That’s not say that perf regressions are welcomed - nobody likes those :).</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)"><div dir="ltr"><div><div dir="auto"><br><br></div>Thanks,<br></div>W.<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at 5:51 PM Vitaly Davidovich <<a href="mailto:vitalyd@gmail.com" target="_blank">vitalyd@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)"><div dir="auto">I’d add rdtsc(p) wrapper functions to the list.  These are usually either inline asm or compiler intrinsic in the JNI entrypoint.  In addition, any native libs exposed via JNI that have “trivial” functions are also candidates for faster calling conventions.  There’re sometimes way to mitigate the call overhead (eg batching) but it’s not always feasible.</div><div dir="auto"><br></div><div dir="auto">I’ll add that last time I tried to measure the improvement of Java criticals for clock_gettime (and rdtsc) it looked to be in the noise on the hardware I was testing on.  It got the point where I had to instrument the critical and normal JNI entrypoints to confirm the critical was being hit.  The critical calling convention isn’t significantly different *if* basic primitives (or no args at all) are passed as args.  JNIEnv*, IIRC, is loaded from a register so that’s minor.  jclass (for static calls, which is what’s relevant here) should be a compiled constant.  Critical call still has a GCLocker check.  So I’m not actually sure what the significant difference is for “lightweight” (ie few primitive or no args, primitive return types) calls.</div><div dir="auto"><br></div><div dir="auto">In general, I do think it’d be nice if there was a faster native call sequence, even if it comes with a caveat emptor and/or special requirements on the callee (not unlike the requirements for criticals).  I think Vladimir Ivanov was working on “snippets” that allowed dynamic construction of a native call, possibly including assembly.  Not sure where that exploration is these days, but that would be a welcome capability.</div><div dir="auto"><br></div><div dir="auto">My $.02.  Happy 4th of July for those celebrating!</div><div dir="auto"><br></div><div dir="auto">Vitaly</div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at 12:04 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">

  <div>

    <p>Hi,<br>

      while I'm not an expert with some of the IO calls you mention

      (some of my colleagues are more knowledgeable in this area, so I'm

      sure they will have more info), my general sense is that, as with

      getrusage, if there is a system call involved, you already pay a

      hefty price for the user to kernel transition. On my machine this

      seem to cost around 200ns. In these cases, using JNI critical to

      shave off a dozen of nanoseconds (at best!) seems just not worth

      it.</p>

    <p>So, of the functions in your list, the ones in which I *believe* 

      dropping transitions would have the most effect are (if we exclude

      getpid, for which another approach is possible) clock_gettime and

      getcpu, I believe, as they might use vdso [1], which typically

      brings the performance of these call closer to calls to shared lib

      functions.<br>

    </p>

    <p>If you have examples e.g. where performance of recvmsg (or

      related calls) varies significantly between base JNI and critical

      JNI, please send them our way; I'm sure some of my colleagues

      would be intersted to take a look.<br>

    </p>

    <p>Popping back a couple of levels, I think it would be helpful to

      also define what's an acceptable regression in this context. Of

      course, in an ideal world,  we'd like to see no performance

      regression at all. But JNI critical is an unsupported interface,

      which might misbehave with modern garbage collectors (e.g. ZGC)

      and that requires quite a bit of internal complexity which might,

      in the medium/long run, hinder the evolution of the Java platform

      (all these things have _some_ cost, even if the cost is not

      directly material to developers). In this vein, I think calls like

      clock_gettime tend to be more problematic: as they complete very

      quickly, you see the cost of transitions a lot more. In other

      cases, where syscalls are involved, the cost associated to

      transitions are more likely to be "in the noise". Of course if we

      look at absolute numbers, dropping transitions would always yield

      "faster" code; but at the same time, going from 250ns to 245ns is

      very unlikely to result in visible performance difference when

      considering an application as a whole, so I think it's critical

      here to decide _which_ use cases to prioritize.<br>

    </p>

    <p>I think a good outcome of this discussion would be if we could

      come to some shared understanding of which native calls are truly

      problematic (e.g. clock_gettime-like), and then for the JDK to

      provide better (and more maintainable) alternatives for those

      (which might even be faster than using critical JNI).<br>

    </p>

    <p>Thanks<br>

      Maurizio<br>

    </p>

    <p>[1] - <a href="https://man7.org/linux/man-pages/man7/vdso.7.html" target="_blank">https://man7.org/linux/man-pages/man7/vdso.7.html</a><br>

    </p></div><div>

    <div>On 04/07/2022 12:23, Wojciech Kudla

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div>

          <div>Thanks Maurizio,<br>

            <br>

          </div>

          I raised this case mainly about clock_gettime and

          recvmsg/sendmsg, I think we're focusing on the wrong things

          here. Feel free to drop the two syscalls from the discussion

          entirely, but the main usecases I have been presenting

          throughout this thread definitely stand.<br>

          <br>

        </div>

        <div>Thanks<br>

        </div>

        <br>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at 10:54

          AM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">

          <div>

            <p>Hi Wojtek,<br>

              thanks for sharing this list, I think this is a good

              starting point to understand more about your use case.</p>

            <p>Last week I've been looking at "getrusage" (as you

              mentioned it in an earlier email), and I was surprised to

              see that the call took a pointer to a (fairly big) struct

              which then needed to be initialized with some thread-local

              state:</p>

            <p><a href="https://man7.org/linux/man-pages/man2/getrusage.2.html" target="_blank">https://man7.org/linux/man-pages/man2/getrusage.2.html</a></p>

            <p>I've looked at the implementation, and it seems to be

              doing memset on the user-provided struct pointer, plus all

              the fields assignment. Eyeballing the implementation, this

              does not seem to me like a "classic" use case where

              dropping transition would help much. I mean, surely

              dropping transitions would help shaving some nanoseconds

              off the call, but it doesn't seem to me that the call

              would be shortlived enough to make a difference. Do you

              have some benchmarks on this one? I did some [1] and the

              call overhead seemed to come up at 260ns/op - w/o

              transition you might perhaps be able to get to 250ns, but

              that's in the noise?<br>

            </p>

            <p>As for getpid, note that you can do (since Java 9):<br>

              <br>

              ProcessHandle.current().pid();<br>

              <br>

              I believe the impl caches the result, so it shouldn't even

              make the native call.<br>

            </p>

            <p>Maurizio</p>

            <p>[1] - <a href="http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java" target="_blank">http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java</a><br>

            </p>

            <div>On 02/07/2022 07:42, Wojciech Kudla wrote:<br>

            </div>

            <blockquote type="cite">

              <div dir="ltr">

                <div>

                  <div>Hi Maurizio,<br>

                    <br>

                  </div>

                  Thanks for staying on this.<br>

                  <br>

                  > Could you please provide a rough list of the

                  native calls you make where you believe critical JNI

                  is having a real impact in the performance of your

                  application?<br>

                </div>

                <div><br>

                  From the top of my head:<br>

                </div>

                <div>clock_gettime<br>

                </div>

                <div>recvmsg<br>

                </div>

                <div>recvmmsg</div>

                <div>sendmsg<br>

                </div>

                <div>sendmmsg</div>

                <div>select<br>

                </div>

                <div>getpid</div>

                <div>getcpu<br>

                </div>

                <div>getrusage<br>

                </div>

                <div><br>

                </div>

                <div>> Also, could you please tell us whether any of

                  these calls need to interact with Java arrays?<br>

                </div>

                <div>No arrays or objects of any type involved.

                  Everything happens by the means of passing raw

                  pointers as longs and using other primitive types as

                  function arguments.<br>

                </div>

                <div><br>

                  > In other words, do you use critical JNI to remove

                  the cost associated with thread transitions, or are

                  you also taking advantage of accessing on-heap memory

                  _directly_ from native code?<br>

                </div>

                <div>Criticial JNI natives are used solely to remove the

                  cost of transitions. We don't get anywhere near java

                  heap in native code.<br>

                  <br>

                </div>

                <div>In general I think it makes a lot of sense for Java

                  as a language/platform to have some guards around

                  unsafe code, but on the other hand the popularity of

                  libraries employing Unsafe and their success in more

                  performance-oriented corners of software engineering

                  is a clear indicator there is a need for the JVM to

                  provide access to more low-level primitives and

                  mechanisms. <br>

                </div>

                <div>I think it's entirely fair to tell developers that

                  all bets are off when they get into some non-idiomatic

                  scenarios but please don't take away a feature that

                  greatly contributed to Java's success.<br>

                  <br>

                </div>

                <div>Kind regards,<br>

                </div>

                <div>Wojtek<br>

                </div>

              </div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">On Wed, Jun 29, 2022

                  at 5:20 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">

                  <div>

                    <p>Hi Wojciech,<br>

                      picking up this thread again. After some internal

                      discussion, we realize that we don't know enough

                      about your use case. While re-enabling JNI

                      critical would obviously provide a quick fix,

                      we're afraid that (a) developers might end up

                      depending on JNI critical when they don't need to

                      (perhaps also unaware of the consequences of

                      depending on it) and (b) that there might actually

                      be _better_ (as in: much faster) solutions than

                      using critical native calls to address at least

                      some of your use cases (that seemed to be the case

                      with the clock_gettime example you mentioned).

                      Could you please provide a rough list of the

                      native calls you make where you believe critical

                      JNI is having a real impact in the performance of

                      your application? Also, could you please tell us

                      whether any of these calls need to interact with

                      Java arrays? In other words, do you use critical

                      JNI to remove the cost associated with thread

                      transitions, or are you also taking advantage of

                      accessing on-heap memory _directly_ from native

                      code?</p>

                    <p>Regards<br>

                      Maurizio<br>

                    </p>

                    <div>On 13/06/2022 21:38, Wojciech Kudla wrote:<br>

                    </div>

                    <blockquote type="cite">

                      <div dir="ltr">

                        <div>

                          <div>

                            <div>

                              <div>

                                <div>Hi Mark,<br>

                                  <br>

                                </div>

                                Thanks for your input and apologies for

                                the delayed response.<br>

                                <br>

                                > If the platform included, say, an

                                intrinsified System.nanoRealTime()<br>

                                method that returned

                                clock_gettime(CLOCK_REALTIME), how much

                                would<br>

                                that help developers in your unnamed

                                industry?<br>

                                <br>

                              </div>

                              Exposing realtime clock with nanosecond

                              granularity in the JDK would be a great

                              step forward. I should have made it clear

                              that I represent fintech corner

                              (investment banking to be exact) but the

                              issues my message touches upon span areas

                              such as HPC, audio processing, gaming, and

                              defense industry so it's not like we have

                              an isolated case.<br>

                              <br>

                              > In a similar vein, if people are

                              finding it necessary to “replace parts<br>

                              of NIO with hand-crafted native code” then

                              it would be interesting to<br>

                              understand what their requirements are<br>

                              <br>

                            </div>

                            As for the other example I provided with

                            making very short lived syscalls such as

                            recvmsg/recvmmsg the premise is getting

                            access to hardware timestamps on the ingress

                            and egress ends as well as enabling batch

                            receive with a single syscall and otherwise

                            exploiting features unavailable from the JDK

                            (like access to CMSG interface,

                            scatter/gather, etc).<br>

                          </div>

                          <div>There are also other examples of calls

                            that we'd love to make often and at lowest

                            possible cost (ie. getrusage) but I'm not

                            sure if there's a strong case for some of

                            these ideas, that's why it might be worth

                            looking into more generic approach for

                            performance sensitive code.<br>

                          </div>

                          <div>Hope this does better job at explaining

                            where we're coming from than my previous

                            messages.<br>

                          </div>

                          <div><br>

                          </div>

                          Thanks,<br>

                        </div>

                        W<br>

                      </div>

                      <br>

                      <div class="gmail_quote">

                        <div dir="ltr" class="gmail_attr">On Tue, Jun 7,

                          2022 at 6:31 PM <<a href="mailto:mark.reinhold@oracle.com" target="_blank">mark.reinhold@oracle.com</a>>

                          wrote:<br>

                        </div>

                        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;padding-left:1ex;border-left-color:rgb(204,204,204)">2022/6/6

                          0:24:17 -0700, <a href="mailto:wkudla.kernel@gmail.com" target="_blank">wkudla.kernel@gmail.com</a>:<br>

                          >> Yes for System.nanoTime(), but

                          System.currentTimeMillis() reports<br>

                          >> CLOCK_REALTIME.<br>

                          > <br>

                          > Unfortunately System.currentTimeMillis()

                          offers only millisecond<br>

                          > granularity which is the reason why our

                          industry has to resort to<br>

                          > clock_gettime.<br>

                          <br>

                          If the platform included, say, an intrinsified

                          System.nanoRealTime()<br>

                          method that returned

                          clock_gettime(CLOCK_REALTIME), how much would<br>

                          that help developers in your unnamed industry?<br>

                          <br>

                          In a similar vein, if people are finding it

                          necessary to “replace parts<br>

                          of NIO with hand-crafted native code” then it

                          would be interesting to<br>

                          understand what their requirements are.  Some

                          simple enhancements to<br>

                          the NIO API would be much less costly to

                          design and implement than a<br>

                          generalized user-level native-call

                          intrinsification mechanism.<br>

                          <br>

                          - Mark<br>

                        </blockquote>

                      </div>

                    </blockquote>

                  </div>

                </blockquote>

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </div>

</blockquote></div></div>-- <br><div dir="ltr">Sent from my phone</div>

</blockquote></div>

</blockquote></div></div>-- <br><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">Sent from my phone</div>