Obsoleting JavaCritical

Mon Jul 4 20:29:41 UTC 2022

On Mon, Jul 4, 2022 at 4:13 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

> Thanks for the clarification, this is very helpful.
>
> I also assume that the case when "there's nothing to read" is common
> enough to make a difference?
>
Kernel bypass networking is poll mode - you poll the NIC for events (rx
and/or tx completions) using a user space driver, there’re no interrupts
and no syscalls.  So yeah, when you poll for reads, you don’t know a priori
if there’re frames to process - only know that after doing the poll.  A
common scenario is polling udp multicast flows.

Besides OpenOnload, there’s the lower level efvi stack (OO uses that
internally), which some folks use directly from Java (with some shims/light
abstractions in native code accessed via JNI).  Mellanox has a similar user
space driver, and of course there’s also DPDK.

> Maurizio
>
>
> On 04/07/2022 13:50, Wojciech Kudla wrote:
>
> Hi Maurizio,
>
> You are correct that under normal circumstances sycalls that are not
> supported by vDSO are very heavy but when we call recvmsg/sendmsg we don't
> even perform a syscall at all. High frequency trading shops employ kernel
> bypass for all network flows pretty much by default. The most popular
> solution here is OpenOnload used with Xilinix products. For a case when
> there's nothing to read from the RX ring a JavaCrtical JNI call to recvmsg
> completes in ~11ns vs 23ns for a standard JNI call with full transition.
> Sorry, I've been in this for so long I kind of assumed it's implied.
>
> Thanks,
> W.
>
> On Mon, Jul 4, 2022 at 12:59 PM Maurizio Cimadamore <
> maurizio.cimadamore at oracle.com> wrote:
>
>> Hi,
>> while I'm not an expert with some of the IO calls you mention (some of my
>> colleagues are more knowledgeable in this area, so I'm sure they will have
>> more info), my general sense is that, as with getrusage, if there is a
>> system call involved, you already pay a hefty price for the user to kernel
>> transition. On my machine this seem to cost around 200ns. In these cases,
>> using JNI critical to shave off a dozen of nanoseconds (at best!) seems
>> just not worth it.
>>
>> So, of the functions in your list, the ones in which I *believe*
>> dropping transitions would have the most effect are (if we exclude getpid,
>> for which another approach is possible) clock_gettime and getcpu, I
>> believe, as they might use vdso [1], which typically brings the performance
>> of these call closer to calls to shared lib functions.
>>
>> If you have examples e.g. where performance of recvmsg (or related calls)
>> varies significantly between base JNI and critical JNI, please send them
>> our way; I'm sure some of my colleagues would be intersted to take a look.
>>
>> Popping back a couple of levels, I think it would be helpful to also
>> define what's an acceptable regression in this context. Of course, in an
>> ideal world,  we'd like to see no performance regression at all. But JNI
>> critical is an unsupported interface, which might misbehave with modern
>> garbage collectors (e.g. ZGC) and that requires quite a bit of internal
>> complexity which might, in the medium/long run, hinder the evolution of the
>> Java platform (all these things have _some_ cost, even if the cost is not
>> directly material to developers). In this vein, I think calls like
>> clock_gettime tend to be more problematic: as they complete very quickly,
>> you see the cost of transitions a lot more. In other cases, where syscalls
>> are involved, the cost associated to transitions are more likely to be "in
>> the noise". Of course if we look at absolute numbers, dropping transitions
>> would always yield "faster" code; but at the same time, going from 250ns to
>> 245ns is very unlikely to result in visible performance difference when
>> considering an application as a whole, so I think it's critical here to
>> decide _which_ use cases to prioritize.
>>
>> I think a good outcome of this discussion would be if we could come to
>> some shared understanding of which native calls are truly problematic (e.g.
>> clock_gettime-like), and then for the JDK to provide better (and more
>> maintainable) alternatives for those (which might even be faster than using
>> critical JNI).
>>
>> Thanks
>> Maurizio
>>
>> [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
>> <https://urldefense.com/v3/__https://man7.org/linux/man-pages/man7/vdso.7.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJrLkUKL8$>
>> On 04/07/2022 12:23, Wojciech Kudla wrote:
>>
>> Thanks Maurizio,
>>
>> I raised this case mainly about clock_gettime and recvmsg/sendmsg, I
>> think we're focusing on the wrong things here. Feel free to drop the two
>> syscalls from the discussion entirely, but the main usecases I have been
>> presenting throughout this thread definitely stand.
>>
>> Thanks
>>
>>
>> On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore <
>> maurizio.cimadamore at oracle.com> wrote:
>>
>>> Hi Wojtek,
>>> thanks for sharing this list, I think this is a good starting point to
>>> understand more about your use case.
>>>
>>> Last week I've been looking at "getrusage" (as you mentioned it in an
>>> earlier email), and I was surprised to see that the call took a pointer to
>>> a (fairly big) struct which then needed to be initialized with some
>>> thread-local state:
>>>
>>> https://man7.org/linux/man-pages/man2/getrusage.2.html
>>> <https://urldefense.com/v3/__https://man7.org/linux/man-pages/man2/getrusage.2.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJXCLGiqw$>
>>>
>>> I've looked at the implementation, and it seems to be doing memset on
>>> the user-provided struct pointer, plus all the fields assignment.
>>> Eyeballing the implementation, this does not seem to me like a "classic"
>>> use case where dropping transition would help much. I mean, surely dropping
>>> transitions would help shaving some nanoseconds off the call, but it
>>> doesn't seem to me that the call would be shortlived enough to make a
>>> difference. Do you have some benchmarks on this one? I did some [1] and the
>>> call overhead seemed to come up at 260ns/op - w/o transition you might
>>> perhaps be able to get to 250ns, but that's in the noise?
>>>
>>> As for getpid, note that you can do (since Java 9):
>>>
>>> ProcessHandle.current().pid();
>>>
>>> I believe the impl caches the result, so it shouldn't even make the
>>> native call.
>>>
>>> Maurizio
>>>
>>> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>>> On 02/07/2022 07:42, Wojciech Kudla wrote:
>>>
>>> Hi Maurizio,
>>>
>>> Thanks for staying on this.
>>>
>>> > Could you please provide a rough list of the native calls you make
>>> where you believe critical JNI is having a real impact in the performance
>>> of your application?
>>>
>>> From the top of my head:
>>> clock_gettime
>>> recvmsg
>>> recvmmsg
>>> sendmsg
>>> sendmmsg
>>> select
>>> getpid
>>> getcpu
>>> getrusage
>>>
>>> > Also, could you please tell us whether any of these calls need to
>>> interact with Java arrays?
>>> No arrays or objects of any type involved. Everything happens by the
>>> means of passing raw pointers as longs and using other primitive types as
>>> function arguments.
>>>
>>> > In other words, do you use critical JNI to remove the cost associated
>>> with thread transitions, or are you also taking advantage of accessing
>>> on-heap memory _directly_ from native code?
>>> Criticial JNI natives are used solely to remove the cost of transitions.
>>> We don't get anywhere near java heap in native code.
>>>
>>> In general I think it makes a lot of sense for Java as a
>>> language/platform to have some guards around unsafe code, but on the other
>>> hand the popularity of libraries employing Unsafe and their success in more
>>> performance-oriented corners of software engineering is a clear indicator
>>> there is a need for the JVM to provide access to more low-level primitives
>>> and mechanisms.
>>> I think it's entirely fair to tell developers that all bets are off when
>>> they get into some non-idiomatic scenarios but please don't take away a
>>> feature that greatly contributed to Java's success.
>>>
>>> Kind regards,
>>> Wojtek
>>>
>>> On Wed, Jun 29, 2022 at 5:20 PM Maurizio Cimadamore <
>>> maurizio.cimadamore at oracle.com> wrote:
>>>
>>>> Hi Wojciech,
>>>> picking up this thread again. After some internal discussion, we
>>>> realize that we don't know enough about your use case. While re-enabling
>>>> JNI critical would obviously provide a quick fix, we're afraid that (a)
>>>> developers might end up depending on JNI critical when they don't need to
>>>> (perhaps also unaware of the consequences of depending on it) and (b) that
>>>> there might actually be _better_ (as in: much faster) solutions than using
>>>> critical native calls to address at least some of your use cases (that
>>>> seemed to be the case with the clock_gettime example you mentioned). Could
>>>> you please provide a rough list of the native calls you make where you
>>>> believe critical JNI is having a real impact in the performance of your
>>>> application? Also, could you please tell us whether any of these calls need
>>>> to interact with Java arrays? In other words, do you use critical JNI to
>>>> remove the cost associated with thread transitions, or are you also taking
>>>> advantage of accessing on-heap memory _directly_ from native code?
>>>>
>>>> Regards
>>>> Maurizio
>>>> On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>>
>>>> Hi Mark,
>>>>
>>>> Thanks for your input and apologies for the delayed response.
>>>>
>>>> > If the platform included, say, an intrinsified System.nanoRealTime()
>>>> method that returned clock_gettime(CLOCK_REALTIME), how much would
>>>> that help developers in your unnamed industry?
>>>>
>>>> Exposing realtime clock with nanosecond granularity in the JDK would be
>>>> a great step forward. I should have made it clear that I represent fintech
>>>> corner (investment banking to be exact) but the issues my message touches
>>>> upon span areas such as HPC, audio processing, gaming, and defense industry
>>>> so it's not like we have an isolated case.
>>>>
>>>> > In a similar vein, if people are finding it necessary to “replace
>>>> parts
>>>> of NIO with hand-crafted native code” then it would be interesting to
>>>> understand what their requirements are
>>>>
>>>> As for the other example I provided with making very short lived
>>>> syscalls such as recvmsg/recvmmsg the premise is getting access to hardware
>>>> timestamps on the ingress and egress ends as well as enabling batch receive
>>>> with a single syscall and otherwise exploiting features unavailable from
>>>> the JDK (like access to CMSG interface, scatter/gather, etc).
>>>> There are also other examples of calls that we'd love to make often and
>>>> at lowest possible cost (ie. getrusage) but I'm not sure if there's a
>>>> strong case for some of these ideas, that's why it might be worth looking
>>>> into more generic approach for performance sensitive code.
>>>> Hope this does better job at explaining where we're coming from than my
>>>> previous messages.
>>>>
>>>> Thanks,
>>>> W
>>>>
>>>> On Tue, Jun 7, 2022 at 6:31 PM <mark.reinhold at oracle.com> wrote:
>>>>
>>>>> 2022/6/6 0:24:17 -0700, wkudla.kernel at gmail.com:
>>>>> >> Yes for System.nanoTime(), but System.currentTimeMillis() reports
>>>>> >> CLOCK_REALTIME.
>>>>> >
>>>>> > Unfortunately System.currentTimeMillis() offers only millisecond
>>>>> > granularity which is the reason why our industry has to resort to
>>>>> > clock_gettime.
>>>>>
>>>>> If the platform included, say, an intrinsified System.nanoRealTime()
>>>>> method that returned clock_gettime(CLOCK_REALTIME), how much would
>>>>> that help developers in your unnamed industry?
>>>>>
>>>>> In a similar vein, if people are finding it necessary to “replace parts
>>>>> of NIO with hand-crafted native code” then it would be interesting to
>>>>> understand what their requirements are.  Some simple enhancements to
>>>>> the NIO API would be much less costly to design and implement than a
>>>>> generalized user-level native-call intrinsification mechanism.
>>>>>
>>>>> - Mark
>>>>>
>>>> --
Sent from my phone
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220704/c2854027/attachment-0001.htm>