<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Hi,<br>
      while I'm not an expert with some of the IO calls you mention
      (some of my colleagues are more knowledgeable in this area, so I'm
      sure they will have more info), my general sense is that, as with
      getrusage, if there is a system call involved, you already pay a
      hefty price for the user to kernel transition. On my machine this
      seem to cost around 200ns. In these cases, using JNI critical to
      shave off a dozen of nanoseconds (at best!) seems just not worth
      it.</p>
    <p>So, of the functions in your list, the ones in which I *believe* 
      dropping transitions would have the most effect are (if we exclude
      getpid, for which another approach is possible) clock_gettime and
      getcpu, I believe, as they might use vdso [1], which typically
      brings the performance of these call closer to calls to shared lib
      functions.<br>
    </p>
    <p>If you have examples e.g. where performance of recvmsg (or
      related calls) varies significantly between base JNI and critical
      JNI, please send them our way; I'm sure some of my colleagues
      would be intersted to take a look.<br>
    </p>
    <p>Popping back a couple of levels, I think it would be helpful to
      also define what's an acceptable regression in this context. Of
      course, in an ideal world,  we'd like to see no performance
      regression at all. But JNI critical is an unsupported interface,
      which might misbehave with modern garbage collectors (e.g. ZGC)
      and that requires quite a bit of internal complexity which might,
      in the medium/long run, hinder the evolution of the Java platform
      (all these things have _some_ cost, even if the cost is not
      directly material to developers). In this vein, I think calls like
      clock_gettime tend to be more problematic: as they complete very
      quickly, you see the cost of transitions a lot more. In other
      cases, where syscalls are involved, the cost associated to
      transitions are more likely to be "in the noise". Of course if we
      look at absolute numbers, dropping transitions would always yield
      "faster" code; but at the same time, going from 250ns to 245ns is
      very unlikely to result in visible performance difference when
      considering an application as a whole, so I think it's critical
      here to decide _which_ use cases to prioritize.<br>
    </p>
    <p>I think a good outcome of this discussion would be if we could
      come to some shared understanding of which native calls are truly
      problematic (e.g. clock_gettime-like), and then for the JDK to
      provide better (and more maintainable) alternatives for those
      (which might even be faster than using critical JNI).<br>
    </p>
    <p>Thanks<br>
      Maurizio<br>
    </p>
    <p>[1] - <a class="moz-txt-link-freetext" href="https://man7.org/linux/man-pages/man7/vdso.7.html">https://man7.org/linux/man-pages/man7/vdso.7.html</a><br>
    </p>
    <div class="moz-cite-prefix">On 04/07/2022 12:23, Wojciech Kudla
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:CADV2yPmzSD7WRvj8RHj5y+DmQudjubdN7mom0HRmQUBOtub3iw@mail.gmail.com">
      
      <div dir="ltr">
        <div>
          <div>Thanks Maurizio,<br>
            <br>
          </div>
          I raised this case mainly about clock_gettime and
          recvmsg/sendmsg, I think we're focusing on the wrong things
          here. Feel free to drop the two syscalls from the discussion
          entirely, but the main usecases I have been presenting
          throughout this thread definitely stand.<br>
          <br>
        </div>
        <div>Thanks<br>
        </div>
        <br>
      </div>
      <br>
      <div class="gmail_quote">
        <div dir="ltr" class="gmail_attr">On Mon, Jul 4, 2022 at 10:54
          AM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px
          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div>
            <p>Hi Wojtek,<br>
              thanks for sharing this list, I think this is a good
              starting point to understand more about your use case.</p>
            <p>Last week I've been looking at "getrusage" (as you
              mentioned it in an earlier email), and I was surprised to
              see that the call took a pointer to a (fairly big) struct
              which then needed to be initialized with some thread-local
              state:</p>
            <p><a href="https://man7.org/linux/man-pages/man2/getrusage.2.html" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://man7.org/linux/man-pages/man2/getrusage.2.html</a></p>
            <p>I've looked at the implementation, and it seems to be
              doing memset on the user-provided struct pointer, plus all
              the fields assignment. Eyeballing the implementation, this
              does not seem to me like a "classic" use case where
              dropping transition would help much. I mean, surely
              dropping transitions would help shaving some nanoseconds
              off the call, but it doesn't seem to me that the call
              would be shortlived enough to make a difference. Do you
              have some benchmarks on this one? I did some [1] and the
              call overhead seemed to come up at 260ns/op - w/o
              transition you might perhaps be able to get to 250ns, but
              that's in the noise?<br>
            </p>
            <p>As for getpid, note that you can do (since Java 9):<br>
              <br>
              ProcessHandle.current().pid();<br>
              <br>
              I believe the impl caches the result, so it shouldn't even
              make the native call.<br>
            </p>
            <p>Maurizio</p>
            <p>[1] - <a href="http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java</a><br>
            </p>
            <div>On 02/07/2022 07:42, Wojciech Kudla wrote:<br>
            </div>
            <blockquote type="cite">
              <div dir="ltr">
                <div>
                  <div>Hi Maurizio,<br>
                    <br>
                  </div>
                  Thanks for staying on this.<br>
                  <br>
                  > Could you please provide a rough list of the
                  native calls you make where you believe critical JNI
                  is having a real impact in the performance of your
                  application?<br>
                </div>
                <div><br>
                  From the top of my head:<br>
                </div>
                <div>clock_gettime<br>
                </div>
                <div>recvmsg<br>
                </div>
                <div>recvmmsg</div>
                <div>sendmsg<br>
                </div>
                <div>sendmmsg</div>
                <div>select<br>
                </div>
                <div>getpid</div>
                <div>getcpu<br>
                </div>
                <div>getrusage<br>
                </div>
                <div><br>
                </div>
                <div>> Also, could you please tell us whether any of
                  these calls need to interact with Java arrays?<br>
                </div>
                <div>No arrays or objects of any type involved.
                  Everything happens by the means of passing raw
                  pointers as longs and using other primitive types as
                  function arguments.<br>
                </div>
                <div><br>
                  > In other words, do you use critical JNI to remove
                  the cost associated with thread transitions, or are
                  you also taking advantage of accessing on-heap memory
                  _directly_ from native code?<br>
                </div>
                <div>Criticial JNI natives are used solely to remove the
                  cost of transitions. We don't get anywhere near java
                  heap in native code.<br>
                  <br>
                </div>
                <div>In general I think it makes a lot of sense for Java
                  as a language/platform to have some guards around
                  unsafe code, but on the other hand the popularity of
                  libraries employing Unsafe and their success in more
                  performance-oriented corners of software engineering
                  is a clear indicator there is a need for the JVM to
                  provide access to more low-level primitives and
                  mechanisms. <br>
                </div>
                <div>I think it's entirely fair to tell developers that
                  all bets are off when they get into some non-idiomatic
                  scenarios but please don't take away a feature that
                  greatly contributed to Java's success.<br>
                  <br>
                </div>
                <div>Kind regards,<br>
                </div>
                <div>Wojtek<br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Wed, Jun 29, 2022
                  at 5:20 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px 0.8ex;border-left:1px solid
                  rgb(204,204,204);padding-left:1ex">
                  <div>
                    <p>Hi Wojciech,<br>
                      picking up this thread again. After some internal
                      discussion, we realize that we don't know enough
                      about your use case. While re-enabling JNI
                      critical would obviously provide a quick fix,
                      we're afraid that (a) developers might end up
                      depending on JNI critical when they don't need to
                      (perhaps also unaware of the consequences of
                      depending on it) and (b) that there might actually
                      be _better_ (as in: much faster) solutions than
                      using critical native calls to address at least
                      some of your use cases (that seemed to be the case
                      with the clock_gettime example you mentioned).
                      Could you please provide a rough list of the
                      native calls you make where you believe critical
                      JNI is having a real impact in the performance of
                      your application? Also, could you please tell us
                      whether any of these calls need to interact with
                      Java arrays? In other words, do you use critical
                      JNI to remove the cost associated with thread
                      transitions, or are you also taking advantage of
                      accessing on-heap memory _directly_ from native
                      code?</p>
                    <p>Regards<br>
                      Maurizio<br>
                    </p>
                    <div>On 13/06/2022 21:38, Wojciech Kudla wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">
                        <div>
                          <div>
                            <div>
                              <div>
                                <div>Hi Mark,<br>
                                  <br>
                                </div>
                                Thanks for your input and apologies for
                                the delayed response.<br>
                                <br>
                                > If the platform included, say, an
                                intrinsified System.nanoRealTime()<br>
                                method that returned
                                clock_gettime(CLOCK_REALTIME), how much
                                would<br>
                                that help developers in your unnamed
                                industry?<br>
                                <br>
                              </div>
                              Exposing realtime clock with nanosecond
                              granularity in the JDK would be a great
                              step forward. I should have made it clear
                              that I represent fintech corner
                              (investment banking to be exact) but the
                              issues my message touches upon span areas
                              such as HPC, audio processing, gaming, and
                              defense industry so it's not like we have
                              an isolated case.<br>
                              <br>
                              > In a similar vein, if people are
                              finding it necessary to “replace parts<br>
                              of NIO with hand-crafted native code” then
                              it would be interesting to<br>
                              understand what their requirements are<br>
                              <br>
                            </div>
                            As for the other example I provided with
                            making very short lived syscalls such as
                            recvmsg/recvmmsg the premise is getting
                            access to hardware timestamps on the ingress
                            and egress ends as well as enabling batch
                            receive with a single syscall and otherwise
                            exploiting features unavailable from the JDK
                            (like access to CMSG interface,
                            scatter/gather, etc).<br>
                          </div>
                          <div>There are also other examples of calls
                            that we'd love to make often and at lowest
                            possible cost (ie. getrusage) but I'm not
                            sure if there's a strong case for some of
                            these ideas, that's why it might be worth
                            looking into more generic approach for
                            performance sensitive code.<br>
                          </div>
                          <div>Hope this does better job at explaining
                            where we're coming from than my previous
                            messages.<br>
                          </div>
                          <div><br>
                          </div>
                          Thanks,<br>
                        </div>
                        W<br>
                      </div>
                      <br>
                      <div class="gmail_quote">
                        <div dir="ltr" class="gmail_attr">On Tue, Jun 7,
                          2022 at 6:31 PM <<a href="mailto:mark.reinhold@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">mark.reinhold@oracle.com</a>>
                          wrote:<br>
                        </div>
                        <blockquote class="gmail_quote" style="margin:0px 0px 0px
                          0.8ex;border-left:1px solid
                          rgb(204,204,204);padding-left:1ex">2022/6/6
                          0:24:17 -0700, <a href="mailto:wkudla.kernel@gmail.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">wkudla.kernel@gmail.com</a>:<br>
                          >> Yes for System.nanoTime(), but
                          System.currentTimeMillis() reports<br>
                          >> CLOCK_REALTIME.<br>
                          > <br>
                          > Unfortunately System.currentTimeMillis()
                          offers only millisecond<br>
                          > granularity which is the reason why our
                          industry has to resort to<br>
                          > clock_gettime.<br>
                          <br>
                          If the platform included, say, an intrinsified
                          System.nanoRealTime()<br>
                          method that returned
                          clock_gettime(CLOCK_REALTIME), how much would<br>
                          that help developers in your unnamed industry?<br>
                          <br>
                          In a similar vein, if people are finding it
                          necessary to “replace parts<br>
                          of NIO with hand-crafted native code” then it
                          would be interesting to<br>
                          understand what their requirements are.  Some
                          simple enhancements to<br>
                          the NIO API would be much less costly to
                          design and implement than a<br>
                          generalized user-level native-call
                          intrinsification mechanism.<br>
                          <br>
                          - Mark<br>
                        </blockquote>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>