<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Thanks for the clarification, this is very helpful.</p>
    <p>I also assume that the case when "there's nothing to read" is
      common enough to make a difference?</p>
    <p>Maurizio<br>
    </p>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 04/07/2022 13:50, Wojciech Kudla
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:CADV2yP=UX6fZYY08D-Ls2G43mTRaV4c8qz1sQgUVgKZF1ZAnHw@mail.gmail.com">
      
      <div dir="ltr">
        <div>
          <div>
            <div>
              <div>Hi Maurizio,<br>
                <br>
              </div>
              You are correct that under normal circumstances sycalls
              that are not supported by vDSO are very heavy but when we
              call recvmsg/sendmsg we don't even perform a syscall at
              all. High frequency trading shops employ kernel bypass for
              all network flows pretty much by default. The most popular
              solution here is OpenOnload used with Xilinix products.
              For a case when there's nothing to read from the RX ring a
              JavaCrtical JNI call to recvmsg completes in ~11ns vs 23ns
              for a standard JNI call with full transition.<br>
            </div>
            Sorry, I've been in this for so long I kind of assumed it's
            implied.<br>
            <br>
          </div>
          Thanks,<br>
        </div>
        W.<br>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at 12:59
          PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div>
            <p>Hi,<br>
              while I'm not an expert with some of the IO calls you
              mention (some of my colleagues are more knowledgeable in
              this area, so I'm sure they will have more info), my
              general sense is that, as with getrusage, if there is a
              system call involved, you already pay a hefty price for
              the user to kernel transition. On my machine this seem to
              cost around 200ns. In these cases, using JNI critical to
              shave off a dozen of nanoseconds (at best!) seems just not
              worth it.</p>
            <p>So, of the functions in your list, the ones in which I
              *believe*  dropping transitions would have the most effect
              are (if we exclude getpid, for which another approach is
              possible) clock_gettime and getcpu, I believe, as they
              might use vdso [1], which typically brings the performance
              of these call closer to calls to shared lib functions.<br>
            </p>
            <p>If you have examples e.g. where performance of recvmsg
              (or related calls) varies significantly between base JNI
              and critical JNI, please send them our way; I'm sure some
              of my colleagues would be intersted to take a look.<br>
            </p>
            <p>Popping back a couple of levels, I think it would be
              helpful to also define what's an acceptable regression in
              this context. Of course, in an ideal world,  we'd like to
              see no performance regression at all. But JNI critical is
              an unsupported interface, which might misbehave with
              modern garbage collectors (e.g. ZGC) and that requires
              quite a bit of internal complexity which might, in the
              medium/long run, hinder the evolution of the Java platform
              (all these things have _some_ cost, even if the cost is
              not directly material to developers). In this vein, I
              think calls like clock_gettime tend to be more
              problematic: as they complete very quickly, you see the
              cost of transitions a lot more. In other cases, where
              syscalls are involved, the cost associated to transitions
              are more likely to be "in the noise". Of course if we look
              at absolute numbers, dropping transitions would always
              yield "faster" code; but at the same time, going from
              250ns to 245ns is very unlikely to result in visible
              performance difference when considering an application as
              a whole, so I think it's critical here to decide _which_
              use cases to prioritize.<br>
            </p>
            <p>I think a good outcome of this discussion would be if we
              could come to some shared understanding of which native
              calls are truly problematic (e.g. clock_gettime-like), and
              then for the JDK to provide better (and more maintainable)
              alternatives for those (which might even be faster than
              using critical JNI).<br>
            </p>
            <p>Thanks<br>
              Maurizio<br>
            </p>
            <p>[1] - <a href="https://urldefense.com/v3/__https://man7.org/linux/man-pages/man7/vdso.7.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJrLkUKL8$" target="_blank" moz-do-not-send="true">https://man7.org/linux/man-pages/man7/vdso.7.html</a><br>
            </p>
            <div>On 04/07/2022 12:23, Wojciech Kudla wrote:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div>
                  <div>Thanks Maurizio,<br>
                    <br>
                  </div>
                  I raised this case mainly about clock_gettime and
                  recvmsg/sendmsg, I think we're focusing on the wrong
                  things here. Feel free to drop the two syscalls from
                  the discussion entirely, but the main usecases I have
                  been presenting throughout this thread definitely
                  stand.<br>
                  <br>
                </div>
                <div>Thanks<br>
                </div>
                <br>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at
                  10:54 AM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;border-left:1px solid
                  rgb(204,204,204);padding-left:1ex">
                  <div>
                    <p>Hi Wojtek,<br>
                      thanks for sharing this list, I think this is a
                      good starting point to understand more about your
                      use case.</p>
                    <p>Last week I've been looking at "getrusage" (as
                      you mentioned it in an earlier email), and I was
                      surprised to see that the call took a pointer to a
                      (fairly big) struct which then needed to be
                      initialized with some thread-local state:</p>
                    <p><a href="https://urldefense.com/v3/__https://man7.org/linux/man-pages/man2/getrusage.2.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJXCLGiqw$" target="_blank" moz-do-not-send="true">https://man7.org/linux/man-pages/man2/getrusage.2.html</a></p>
                    <p>I've looked at the implementation, and it seems
                      to be doing memset on the user-provided struct
                      pointer, plus all the fields assignment.
                      Eyeballing the implementation, this does not seem
                      to me like a "classic" use case where dropping
                      transition would help much. I mean, surely
                      dropping transitions would help shaving some
                      nanoseconds off the call, but it doesn't seem to
                      me that the call would be shortlived enough to
                      make a difference. Do you have some benchmarks on
                      this one? I did some [1] and the call overhead
                      seemed to come up at 260ns/op - w/o transition you
                      might perhaps be able to get to 250ns, but that's
                      in the noise?<br>
                    </p>
                    <p>As for getpid, note that you can do (since Java
                      9):<br>
                      <br>
                      ProcessHandle.current().pid();<br>
                      <br>
                      I believe the impl caches the result, so it
                      shouldn't even make the native call.<br>
                    </p>
                    <p>Maurizio</p>
                    <p>[1] - <a href="http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java</a><br>
                    </p>
                    <div>On 02/07/2022 07:42, Wojciech Kudla wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div>
                          <div>Hi Maurizio,<br>
                            <br>
                          </div>
                          Thanks for staying on this.<br>
                          <br>
                          > Could you please provide a rough list of
                          the native calls you make where you believe
                          critical JNI is having a real impact in the
                          performance of your application?<br>
                        </div>
                        <div><br>
                          From the top of my head:<br>
                        </div>
                        <div>clock_gettime<br>
                        </div>
                        <div>recvmsg<br>
                        </div>
                        <div>recvmmsg</div>
                        <div>sendmsg<br>
                        </div>
                        <div>sendmmsg</div>
                        <div>select<br>
                        </div>
                        <div>getpid</div>
                        <div>getcpu<br>
                        </div>
                        <div>getrusage<br>
                        </div>
                        <div><br>
                        </div>
                        <div>> Also, could you please tell us whether
                          any of these calls need to interact with Java
                          arrays?<br>
                        </div>
                        <div>No arrays or objects of any type involved.
                          Everything happens by the means of passing raw
                          pointers as longs and using other primitive
                          types as function arguments.<br>
                        </div>
                        <div><br>
                          > In other words, do you use critical JNI
                          to remove the cost associated with thread
                          transitions, or are you also taking advantage
                          of accessing on-heap memory _directly_ from
                          native code?<br>
                        </div>
                        <div>Criticial JNI natives are used solely to
                          remove the cost of transitions. We don't get
                          anywhere near java heap in native code.<br>
                          <br>
                        </div>
                        <div>In general I think it makes a lot of sense
                          for Java as a language/platform to have some
                          guards around unsafe code, but on the other
                          hand the popularity of libraries employing
                          Unsafe and their success in more
                          performance-oriented corners of software
                          engineering is a clear indicator there is a
                          need for the JVM to provide access to more
                          low-level primitives and mechanisms. <br>
                        </div>
                        <div>I think it's entirely fair to tell
                          developers that all bets are off when they get
                          into some non-idiomatic scenarios but please
                          don't take away a feature that greatly
                          contributed to Java's success.<br>
                          <br>
                        </div>
                        <div>Kind regards,<br>
                        </div>
                        <div>Wojtek<br>
                        </div>
                      </div>
                      <br>
                      <div class="gmail_quote">
                        <div dir="ltr" class="gmail_attr">On Wed, Jun
                          29, 2022 at 5:20 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>
                          wrote:<br>
                        </div>
                        <blockquote class="gmail_quote" style="margin:0px 0px 0px
                          0.8ex;border-left:1px solid
                          rgb(204,204,204);padding-left:1ex">
                          <div>
                            <p>Hi Wojciech,<br>
                              picking up this thread again. After some
                              internal discussion, we realize that we
                              don't know enough about your use case.
                              While re-enabling JNI critical would
                              obviously provide a quick fix, we're
                              afraid that (a) developers might end up
                              depending on JNI critical when they don't
                              need to (perhaps also unaware of the
                              consequences of depending on it) and (b)
                              that there might actually be _better_ (as
                              in: much faster) solutions than using
                              critical native calls to address at least
                              some of your use cases (that seemed to be
                              the case with the clock_gettime example
                              you mentioned). Could you please provide a
                              rough list of the native calls you make
                              where you believe critical JNI is having a
                              real impact in the performance of your
                              application? Also, could you please tell
                              us whether any of these calls need to
                              interact with Java arrays? In other words,
                              do you use critical JNI to remove the cost
                              associated with thread transitions, or are
                              you also taking advantage of accessing
                              on-heap memory _directly_ from native
                              code?</p>
                            <p>Regards<br>
                              Maurizio<br>
                            </p>
                            <div>On 13/06/2022 21:38, Wojciech Kudla
                              wrote:<br>
                            </div>
                            <blockquote type="cite">
                              <div dir="ltr">
                                <div>
                                  <div>
                                    <div>
                                      <div>
                                        <div>Hi Mark,<br>
                                          <br>
                                        </div>
                                        Thanks for your input and
                                        apologies for the delayed
                                        response.<br>
                                        <br>
                                        > If the platform included,
                                        say, an intrinsified
                                        System.nanoRealTime()<br>
                                        method that returned
                                        clock_gettime(CLOCK_REALTIME),
                                        how much would<br>
                                        that help developers in your
                                        unnamed industry?<br>
                                        <br>
                                      </div>
                                      Exposing realtime clock with
                                      nanosecond granularity in the JDK
                                      would be a great step forward. I
                                      should have made it clear that I
                                      represent fintech corner
                                      (investment banking to be exact)
                                      but the issues my message touches
                                      upon span areas such as HPC, audio
                                      processing, gaming, and defense
                                      industry so it's not like we have
                                      an isolated case.<br>
                                      <br>
                                      > In a similar vein, if people
                                      are finding it necessary to
                                      “replace parts<br>
                                      of NIO with hand-crafted native
                                      code” then it would be interesting
                                      to<br>
                                      understand what their requirements
                                      are<br>
                                      <br>
                                    </div>
                                    As for the other example I provided
                                    with making very short lived
                                    syscalls such as recvmsg/recvmmsg
                                    the premise is getting access to
                                    hardware timestamps on the ingress
                                    and egress ends as well as enabling
                                    batch receive with a single syscall
                                    and otherwise exploiting features
                                    unavailable from the JDK (like
                                    access to CMSG interface,
                                    scatter/gather, etc).<br>
                                  </div>
                                  <div>There are also other examples of
                                    calls that we'd love to make often
                                    and at lowest possible cost (ie.
                                    getrusage) but I'm not sure if
                                    there's a strong case for some of
                                    these ideas, that's why it might be
                                    worth looking into more generic
                                    approach for performance sensitive
                                    code.<br>
                                  </div>
                                  <div>Hope this does better job at
                                    explaining where we're coming from
                                    than my previous messages.<br>
                                  </div>
                                  <div><br>
                                  </div>
                                  Thanks,<br>
                                </div>
                                W<br>
                              </div>
                              <br>
                              <div class="gmail_quote">
                                <div dir="ltr" class="gmail_attr">On
                                  Tue, Jun 7, 2022 at 6:31 PM <<a href="mailto:mark.reinhold@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">mark.reinhold@oracle.com</a>>
                                  wrote:<br>
                                </div>
                                <blockquote class="gmail_quote" style="margin:0px 0px 0px
                                  0.8ex;border-left:1px solid
                                  rgb(204,204,204);padding-left:1ex">2022/6/6
                                  0:24:17 -0700, <a href="mailto:wkudla.kernel@gmail.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">wkudla.kernel@gmail.com</a>:<br>
                                  >> Yes for System.nanoTime(),
                                  but System.currentTimeMillis() reports<br>
                                  >> CLOCK_REALTIME.<br>
                                  > <br>
                                  > Unfortunately
                                  System.currentTimeMillis() offers only
                                  millisecond<br>
                                  > granularity which is the reason
                                  why our industry has to resort to<br>
                                  > clock_gettime.<br>
                                  <br>
                                  If the platform included, say, an
                                  intrinsified System.nanoRealTime()<br>
                                  method that returned
                                  clock_gettime(CLOCK_REALTIME), how
                                  much would<br>
                                  that help developers in your unnamed
                                  industry?<br>
                                  <br>
                                  In a similar vein, if people are
                                  finding it necessary to “replace parts<br>
                                  of NIO with hand-crafted native code”
                                  then it would be interesting to<br>
                                  understand what their requirements
                                  are.  Some simple enhancements to<br>
                                  the NIO API would be much less costly
                                  to design and implement than a<br>
                                  generalized user-level native-call
                                  intrinsification mechanism.<br>
                                  <br>
                                  - Mark<br>
                                </blockquote>
                              </div>
                            </blockquote>
                          </div>
                        </blockquote>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>