Obsoleting JavaCritical

Mon Jul 4 13:27:13 UTC 2022

Thanks for the clarification, this is very helpful.

I also assume that the case when "there's nothing to read" is common 
enough to make a difference?

Maurizio

On 04/07/2022 13:50, Wojciech Kudla wrote:
> Hi Maurizio,
>
> You are correct that under normal circumstances sycalls that are not 
> supported by vDSO are very heavy but when we call recvmsg/sendmsg we 
> don't even perform a syscall at all. High frequency trading shops 
> employ kernel bypass for all network flows pretty much by default. The 
> most popular solution here is OpenOnload used with Xilinix products. 
> For a case when there's nothing to read from the RX ring a JavaCrtical 
> JNI call to recvmsg completes in ~11ns vs 23ns for a standard JNI call 
> with full transition.
> Sorry, I've been in this for so long I kind of assumed it's implied.
>
> Thanks,
> W.
>
> On Mon, Jul 4, 2022 at 12:59 PM Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com> wrote:
>
>     Hi,
>     while I'm not an expert with some of the IO calls you mention
>     (some of my colleagues are more knowledgeable in this area, so I'm
>     sure they will have more info), my general sense is that, as with
>     getrusage, if there is a system call involved, you already pay a
>     hefty price for the user to kernel transition. On my machine this
>     seem to cost around 200ns. In these cases, using JNI critical to
>     shave off a dozen of nanoseconds (at best!) seems just not worth it.
>
>     So, of the functions in your list, the ones in which I *believe* 
>     dropping transitions would have the most effect are (if we exclude
>     getpid, for which another approach is possible) clock_gettime and
>     getcpu, I believe, as they might use vdso [1], which typically
>     brings the performance of these call closer to calls to shared lib
>     functions.
>
>     If you have examples e.g. where performance of recvmsg (or related
>     calls) varies significantly between base JNI and critical JNI,
>     please send them our way; I'm sure some of my colleagues would be
>     intersted to take a look.
>
>     Popping back a couple of levels, I think it would be helpful to
>     also define what's an acceptable regression in this context. Of
>     course, in an ideal world,  we'd like to see no performance
>     regression at all. But JNI critical is an unsupported interface,
>     which might misbehave with modern garbage collectors (e.g. ZGC)
>     and that requires quite a bit of internal complexity which might,
>     in the medium/long run, hinder the evolution of the Java platform
>     (all these things have _some_ cost, even if the cost is not
>     directly material to developers). In this vein, I think calls like
>     clock_gettime tend to be more problematic: as they complete very
>     quickly, you see the cost of transitions a lot more. In other
>     cases, where syscalls are involved, the cost associated to
>     transitions are more likely to be "in the noise". Of course if we
>     look at absolute numbers, dropping transitions would always yield
>     "faster" code; but at the same time, going from 250ns to 245ns is
>     very unlikely to result in visible performance difference when
>     considering an application as a whole, so I think it's critical
>     here to decide _which_ use cases to prioritize.
>
>     I think a good outcome of this discussion would be if we could
>     come to some shared understanding of which native calls are truly
>     problematic (e.g. clock_gettime-like), and then for the JDK to
>     provide better (and more maintainable) alternatives for those
>     (which might even be faster than using critical JNI).
>
>     Thanks
>     Maurizio
>
>     [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
>     <https://urldefense.com/v3/__https://man7.org/linux/man-pages/man7/vdso.7.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJrLkUKL8$>
>
>     On 04/07/2022 12:23, Wojciech Kudla wrote:
>>     Thanks Maurizio,
>>
>>     I raised this case mainly about clock_gettime and
>>     recvmsg/sendmsg, I think we're focusing on the wrong things here.
>>     Feel free to drop the two syscalls from the discussion entirely,
>>     but the main usecases I have been presenting throughout this
>>     thread definitely stand.
>>
>>     Thanks
>>
>>
>>     On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore
>>     <maurizio.cimadamore at oracle.com> wrote:
>>
>>         Hi Wojtek,
>>         thanks for sharing this list, I think this is a good starting
>>         point to understand more about your use case.
>>
>>         Last week I've been looking at "getrusage" (as you mentioned
>>         it in an earlier email), and I was surprised to see that the
>>         call took a pointer to a (fairly big) struct which then
>>         needed to be initialized with some thread-local state:
>>
>>         https://man7.org/linux/man-pages/man2/getrusage.2.html
>>         <https://urldefense.com/v3/__https://man7.org/linux/man-pages/man2/getrusage.2.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJXCLGiqw$>
>>
>>         I've looked at the implementation, and it seems to be doing
>>         memset on the user-provided struct pointer, plus all the
>>         fields assignment. Eyeballing the implementation, this does
>>         not seem to me like a "classic" use case where dropping
>>         transition would help much. I mean, surely dropping
>>         transitions would help shaving some nanoseconds off the call,
>>         but it doesn't seem to me that the call would be shortlived
>>         enough to make a difference. Do you have some benchmarks on
>>         this one? I did some [1] and the call overhead seemed to come
>>         up at 260ns/op - w/o transition you might perhaps be able to
>>         get to 250ns, but that's in the noise?
>>
>>         As for getpid, note that you can do (since Java 9):
>>
>>         ProcessHandle.current().pid();
>>
>>         I believe the impl caches the result, so it shouldn't even
>>         make the native call.
>>
>>         Maurizio
>>
>>         [1] -
>>         http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>>
>>         On 02/07/2022 07:42, Wojciech Kudla wrote:
>>>         Hi Maurizio,
>>>
>>>         Thanks for staying on this.
>>>
>>>         > Could you please provide a rough list of the native calls
>>>         you make where you believe critical JNI is having a real
>>>         impact in the performance of your application?
>>>
>>>         From the top of my head:
>>>         clock_gettime
>>>         recvmsg
>>>         recvmmsg
>>>         sendmsg
>>>         sendmmsg
>>>         select
>>>         getpid
>>>         getcpu
>>>         getrusage
>>>
>>>         > Also, could you please tell us whether any of these calls
>>>         need to interact with Java arrays?
>>>         No arrays or objects of any type involved. Everything
>>>         happens by the means of passing raw pointers as longs and
>>>         using other primitive types as function arguments.
>>>
>>>         > In other words, do you use critical JNI to remove the cost
>>>         associated with thread transitions, or are you also taking
>>>         advantage of accessing on-heap memory _directly_ from native
>>>         code?
>>>         Criticial JNI natives are used solely to remove the cost of
>>>         transitions. We don't get anywhere near java heap in native
>>>         code.
>>>
>>>         In general I think it makes a lot of sense for Java as a
>>>         language/platform to have some guards around unsafe code,
>>>         but on the other hand the popularity of libraries employing
>>>         Unsafe and their success in more performance-oriented
>>>         corners of software engineering is a clear indicator there
>>>         is a need for the JVM to provide access to more low-level
>>>         primitives and mechanisms.
>>>         I think it's entirely fair to tell developers that all bets
>>>         are off when they get into some non-idiomatic scenarios but
>>>         please don't take away a feature that greatly contributed to
>>>         Java's success.
>>>
>>>         Kind regards,
>>>         Wojtek
>>>
>>>         On Wed, Jun 29, 2022 at 5:20 PM Maurizio Cimadamore
>>>         <maurizio.cimadamore at oracle.com> wrote:
>>>
>>>             Hi Wojciech,
>>>             picking up this thread again. After some internal
>>>             discussion, we realize that we don't know enough about
>>>             your use case. While re-enabling JNI critical would
>>>             obviously provide a quick fix, we're afraid that (a)
>>>             developers might end up depending on JNI critical when
>>>             they don't need to (perhaps also unaware of the
>>>             consequences of depending on it) and (b) that there
>>>             might actually be _better_ (as in: much faster)
>>>             solutions than using critical native calls to address at
>>>             least some of your use cases (that seemed to be the case
>>>             with the clock_gettime example you mentioned). Could you
>>>             please provide a rough list of the native calls you make
>>>             where you believe critical JNI is having a real impact
>>>             in the performance of your application? Also, could you
>>>             please tell us whether any of these calls need to
>>>             interact with Java arrays? In other words, do you use
>>>             critical JNI to remove the cost associated with thread
>>>             transitions, or are you also taking advantage of
>>>             accessing on-heap memory _directly_ from native code?
>>>
>>>             Regards
>>>             Maurizio
>>>
>>>             On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>>             Hi Mark,
>>>>
>>>>             Thanks for your input and apologies for the delayed
>>>>             response.
>>>>
>>>>             > If the platform included, say, an intrinsified
>>>>             System.nanoRealTime()
>>>>             method that returned clock_gettime(CLOCK_REALTIME), how
>>>>             much would
>>>>             that help developers in your unnamed industry?
>>>>
>>>>             Exposing realtime clock with nanosecond granularity in
>>>>             the JDK would be a great step forward. I should have
>>>>             made it clear that I represent fintech corner
>>>>             (investment banking to be exact) but the issues my
>>>>             message touches upon span areas such as HPC, audio
>>>>             processing, gaming, and defense industry so it's not
>>>>             like we have an isolated case.
>>>>
>>>>             > In a similar vein, if people are finding it necessary
>>>>             to “replace parts
>>>>             of NIO with hand-crafted native code” then it would be
>>>>             interesting to
>>>>             understand what their requirements are
>>>>
>>>>             As for the other example I provided with making very
>>>>             short lived syscalls such as recvmsg/recvmmsg the
>>>>             premise is getting access to hardware timestamps on the
>>>>             ingress and egress ends as well as enabling batch
>>>>             receive with a single syscall and otherwise exploiting
>>>>             features unavailable from the JDK (like access to CMSG
>>>>             interface, scatter/gather, etc).
>>>>             There are also other examples of calls that we'd love
>>>>             to make often and at lowest possible cost (ie.
>>>>             getrusage) but I'm not sure if there's a strong case
>>>>             for some of these ideas, that's why it might be worth
>>>>             looking into more generic approach for performance
>>>>             sensitive code.
>>>>             Hope this does better job at explaining where we're
>>>>             coming from than my previous messages.
>>>>
>>>>             Thanks,
>>>>             W
>>>>
>>>>             On Tue, Jun 7, 2022 at 6:31 PM
>>>>             <mark.reinhold at oracle.com> wrote:
>>>>
>>>>                 2022/6/6 0:24:17 -0700, wkudla.kernel at gmail.com:
>>>>                 >> Yes for System.nanoTime(), but
>>>>                 System.currentTimeMillis() reports
>>>>                 >> CLOCK_REALTIME.
>>>>                 >
>>>>                 > Unfortunately System.currentTimeMillis() offers
>>>>                 only millisecond
>>>>                 > granularity which is the reason why our industry
>>>>                 has to resort to
>>>>                 > clock_gettime.
>>>>
>>>>                 If the platform included, say, an intrinsified
>>>>                 System.nanoRealTime()
>>>>                 method that returned clock_gettime(CLOCK_REALTIME),
>>>>                 how much would
>>>>                 that help developers in your unnamed industry?
>>>>
>>>>                 In a similar vein, if people are finding it
>>>>                 necessary to “replace parts
>>>>                 of NIO with hand-crafted native code” then it would
>>>>                 be interesting to
>>>>                 understand what their requirements are.  Some
>>>>                 simple enhancements to
>>>>                 the NIO API would be much less costly to design and
>>>>                 implement than a
>>>>                 generalized user-level native-call intrinsification
>>>>                 mechanism.
>>>>
>>>>                 - Mark
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220704/9c1efd0f/attachment-0001.htm>