Obsoleting JavaCritical

Wojciech Kudla wkudla.kernel at gmail.com
Mon Jul 4 12:50:47 UTC 2022


Hi Maurizio,

You are correct that under normal circumstances sycalls that are not
supported by vDSO are very heavy but when we call recvmsg/sendmsg we don't
even perform a syscall at all. High frequency trading shops employ kernel
bypass for all network flows pretty much by default. The most popular
solution here is OpenOnload used with Xilinix products. For a case when
there's nothing to read from the RX ring a JavaCrtical JNI call to recvmsg
completes in ~11ns vs 23ns for a standard JNI call with full transition.
Sorry, I've been in this for so long I kind of assumed it's implied.

Thanks,
W.

On Mon, Jul 4, 2022 at 12:59 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

> Hi,
> while I'm not an expert with some of the IO calls you mention (some of my
> colleagues are more knowledgeable in this area, so I'm sure they will have
> more info), my general sense is that, as with getrusage, if there is a
> system call involved, you already pay a hefty price for the user to kernel
> transition. On my machine this seem to cost around 200ns. In these cases,
> using JNI critical to shave off a dozen of nanoseconds (at best!) seems
> just not worth it.
>
> So, of the functions in your list, the ones in which I *believe*  dropping
> transitions would have the most effect are (if we exclude getpid, for which
> another approach is possible) clock_gettime and getcpu, I believe, as they
> might use vdso [1], which typically brings the performance of these call
> closer to calls to shared lib functions.
>
> If you have examples e.g. where performance of recvmsg (or related calls)
> varies significantly between base JNI and critical JNI, please send them
> our way; I'm sure some of my colleagues would be intersted to take a look.
>
> Popping back a couple of levels, I think it would be helpful to also
> define what's an acceptable regression in this context. Of course, in an
> ideal world,  we'd like to see no performance regression at all. But JNI
> critical is an unsupported interface, which might misbehave with modern
> garbage collectors (e.g. ZGC) and that requires quite a bit of internal
> complexity which might, in the medium/long run, hinder the evolution of the
> Java platform (all these things have _some_ cost, even if the cost is not
> directly material to developers). In this vein, I think calls like
> clock_gettime tend to be more problematic: as they complete very quickly,
> you see the cost of transitions a lot more. In other cases, where syscalls
> are involved, the cost associated to transitions are more likely to be "in
> the noise". Of course if we look at absolute numbers, dropping transitions
> would always yield "faster" code; but at the same time, going from 250ns to
> 245ns is very unlikely to result in visible performance difference when
> considering an application as a whole, so I think it's critical here to
> decide _which_ use cases to prioritize.
>
> I think a good outcome of this discussion would be if we could come to
> some shared understanding of which native calls are truly problematic (e.g.
> clock_gettime-like), and then for the JDK to provide better (and more
> maintainable) alternatives for those (which might even be faster than using
> critical JNI).
>
> Thanks
> Maurizio
>
> [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
> On 04/07/2022 12:23, Wojciech Kudla wrote:
>
> Thanks Maurizio,
>
> I raised this case mainly about clock_gettime and recvmsg/sendmsg, I think
> we're focusing on the wrong things here. Feel free to drop the two syscalls
> from the discussion entirely, but the main usecases I have been presenting
> throughout this thread definitely stand.
>
> Thanks
>
>
> On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore <
> maurizio.cimadamore at oracle.com> wrote:
>
>> Hi Wojtek,
>> thanks for sharing this list, I think this is a good starting point to
>> understand more about your use case.
>>
>> Last week I've been looking at "getrusage" (as you mentioned it in an
>> earlier email), and I was surprised to see that the call took a pointer to
>> a (fairly big) struct which then needed to be initialized with some
>> thread-local state:
>>
>> https://man7.org/linux/man-pages/man2/getrusage.2.html
>>
>> I've looked at the implementation, and it seems to be doing memset on the
>> user-provided struct pointer, plus all the fields assignment. Eyeballing
>> the implementation, this does not seem to me like a "classic" use case
>> where dropping transition would help much. I mean, surely dropping
>> transitions would help shaving some nanoseconds off the call, but it
>> doesn't seem to me that the call would be shortlived enough to make a
>> difference. Do you have some benchmarks on this one? I did some [1] and the
>> call overhead seemed to come up at 260ns/op - w/o transition you might
>> perhaps be able to get to 250ns, but that's in the noise?
>>
>> As for getpid, note that you can do (since Java 9):
>>
>> ProcessHandle.current().pid();
>>
>> I believe the impl caches the result, so it shouldn't even make the
>> native call.
>>
>> Maurizio
>>
>> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>> On 02/07/2022 07:42, Wojciech Kudla wrote:
>>
>> Hi Maurizio,
>>
>> Thanks for staying on this.
>>
>> > Could you please provide a rough list of the native calls you make
>> where you believe critical JNI is having a real impact in the performance
>> of your application?
>>
>> From the top of my head:
>> clock_gettime
>> recvmsg
>> recvmmsg
>> sendmsg
>> sendmmsg
>> select
>> getpid
>> getcpu
>> getrusage
>>
>> > Also, could you please tell us whether any of these calls need to
>> interact with Java arrays?
>> No arrays or objects of any type involved. Everything happens by the
>> means of passing raw pointers as longs and using other primitive types as
>> function arguments.
>>
>> > In other words, do you use critical JNI to remove the cost associated
>> with thread transitions, or are you also taking advantage of accessing
>> on-heap memory _directly_ from native code?
>> Criticial JNI natives are used solely to remove the cost of transitions.
>> We don't get anywhere near java heap in native code.
>>
>> In general I think it makes a lot of sense for Java as a
>> language/platform to have some guards around unsafe code, but on the other
>> hand the popularity of libraries employing Unsafe and their success in more
>> performance-oriented corners of software engineering is a clear indicator
>> there is a need for the JVM to provide access to more low-level primitives
>> and mechanisms.
>> I think it's entirely fair to tell developers that all bets are off when
>> they get into some non-idiomatic scenarios but please don't take away a
>> feature that greatly contributed to Java's success.
>>
>> Kind regards,
>> Wojtek
>>
>> On Wed, Jun 29, 2022 at 5:20 PM Maurizio Cimadamore <
>> maurizio.cimadamore at oracle.com> wrote:
>>
>>> Hi Wojciech,
>>> picking up this thread again. After some internal discussion, we realize
>>> that we don't know enough about your use case. While re-enabling JNI
>>> critical would obviously provide a quick fix, we're afraid that (a)
>>> developers might end up depending on JNI critical when they don't need to
>>> (perhaps also unaware of the consequences of depending on it) and (b) that
>>> there might actually be _better_ (as in: much faster) solutions than using
>>> critical native calls to address at least some of your use cases (that
>>> seemed to be the case with the clock_gettime example you mentioned). Could
>>> you please provide a rough list of the native calls you make where you
>>> believe critical JNI is having a real impact in the performance of your
>>> application? Also, could you please tell us whether any of these calls need
>>> to interact with Java arrays? In other words, do you use critical JNI to
>>> remove the cost associated with thread transitions, or are you also taking
>>> advantage of accessing on-heap memory _directly_ from native code?
>>>
>>> Regards
>>> Maurizio
>>> On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>
>>> Hi Mark,
>>>
>>> Thanks for your input and apologies for the delayed response.
>>>
>>> > If the platform included, say, an intrinsified System.nanoRealTime()
>>> method that returned clock_gettime(CLOCK_REALTIME), how much would
>>> that help developers in your unnamed industry?
>>>
>>> Exposing realtime clock with nanosecond granularity in the JDK would be
>>> a great step forward. I should have made it clear that I represent fintech
>>> corner (investment banking to be exact) but the issues my message touches
>>> upon span areas such as HPC, audio processing, gaming, and defense industry
>>> so it's not like we have an isolated case.
>>>
>>> > In a similar vein, if people are finding it necessary to “replace parts
>>> of NIO with hand-crafted native code” then it would be interesting to
>>> understand what their requirements are
>>>
>>> As for the other example I provided with making very short lived
>>> syscalls such as recvmsg/recvmmsg the premise is getting access to hardware
>>> timestamps on the ingress and egress ends as well as enabling batch receive
>>> with a single syscall and otherwise exploiting features unavailable from
>>> the JDK (like access to CMSG interface, scatter/gather, etc).
>>> There are also other examples of calls that we'd love to make often and
>>> at lowest possible cost (ie. getrusage) but I'm not sure if there's a
>>> strong case for some of these ideas, that's why it might be worth looking
>>> into more generic approach for performance sensitive code.
>>> Hope this does better job at explaining where we're coming from than my
>>> previous messages.
>>>
>>> Thanks,
>>> W
>>>
>>> On Tue, Jun 7, 2022 at 6:31 PM <mark.reinhold at oracle.com> wrote:
>>>
>>>> 2022/6/6 0:24:17 -0700, wkudla.kernel at gmail.com:
>>>> >> Yes for System.nanoTime(), but System.currentTimeMillis() reports
>>>> >> CLOCK_REALTIME.
>>>> >
>>>> > Unfortunately System.currentTimeMillis() offers only millisecond
>>>> > granularity which is the reason why our industry has to resort to
>>>> > clock_gettime.
>>>>
>>>> If the platform included, say, an intrinsified System.nanoRealTime()
>>>> method that returned clock_gettime(CLOCK_REALTIME), how much would
>>>> that help developers in your unnamed industry?
>>>>
>>>> In a similar vein, if people are finding it necessary to “replace parts
>>>> of NIO with hand-crafted native code” then it would be interesting to
>>>> understand what their requirements are.  Some simple enhancements to
>>>> the NIO API would be much less costly to design and implement than a
>>>> generalized user-level native-call intrinsification mechanism.
>>>>
>>>> - Mark
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220704/7c1c6b00/attachment-0001.htm>


More information about the panama-dev mailing list