Obsoleting JavaCritical
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Jul 4 13:27:13 UTC 2022
Thanks for the clarification, this is very helpful.
I also assume that the case when "there's nothing to read" is common
enough to make a difference?
Maurizio
On 04/07/2022 13:50, Wojciech Kudla wrote:
> Hi Maurizio,
>
> You are correct that under normal circumstances sycalls that are not
> supported by vDSO are very heavy but when we call recvmsg/sendmsg we
> don't even perform a syscall at all. High frequency trading shops
> employ kernel bypass for all network flows pretty much by default. The
> most popular solution here is OpenOnload used with Xilinix products.
> For a case when there's nothing to read from the RX ring a JavaCrtical
> JNI call to recvmsg completes in ~11ns vs 23ns for a standard JNI call
> with full transition.
> Sorry, I've been in this for so long I kind of assumed it's implied.
>
> Thanks,
> W.
>
> On Mon, Jul 4, 2022 at 12:59 PM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>
> Hi,
> while I'm not an expert with some of the IO calls you mention
> (some of my colleagues are more knowledgeable in this area, so I'm
> sure they will have more info), my general sense is that, as with
> getrusage, if there is a system call involved, you already pay a
> hefty price for the user to kernel transition. On my machine this
> seem to cost around 200ns. In these cases, using JNI critical to
> shave off a dozen of nanoseconds (at best!) seems just not worth it.
>
> So, of the functions in your list, the ones in which I *believe*
> dropping transitions would have the most effect are (if we exclude
> getpid, for which another approach is possible) clock_gettime and
> getcpu, I believe, as they might use vdso [1], which typically
> brings the performance of these call closer to calls to shared lib
> functions.
>
> If you have examples e.g. where performance of recvmsg (or related
> calls) varies significantly between base JNI and critical JNI,
> please send them our way; I'm sure some of my colleagues would be
> intersted to take a look.
>
> Popping back a couple of levels, I think it would be helpful to
> also define what's an acceptable regression in this context. Of
> course, in an ideal world, we'd like to see no performance
> regression at all. But JNI critical is an unsupported interface,
> which might misbehave with modern garbage collectors (e.g. ZGC)
> and that requires quite a bit of internal complexity which might,
> in the medium/long run, hinder the evolution of the Java platform
> (all these things have _some_ cost, even if the cost is not
> directly material to developers). In this vein, I think calls like
> clock_gettime tend to be more problematic: as they complete very
> quickly, you see the cost of transitions a lot more. In other
> cases, where syscalls are involved, the cost associated to
> transitions are more likely to be "in the noise". Of course if we
> look at absolute numbers, dropping transitions would always yield
> "faster" code; but at the same time, going from 250ns to 245ns is
> very unlikely to result in visible performance difference when
> considering an application as a whole, so I think it's critical
> here to decide _which_ use cases to prioritize.
>
> I think a good outcome of this discussion would be if we could
> come to some shared understanding of which native calls are truly
> problematic (e.g. clock_gettime-like), and then for the JDK to
> provide better (and more maintainable) alternatives for those
> (which might even be faster than using critical JNI).
>
> Thanks
> Maurizio
>
> [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
> <https://urldefense.com/v3/__https://man7.org/linux/man-pages/man7/vdso.7.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJrLkUKL8$>
>
> On 04/07/2022 12:23, Wojciech Kudla wrote:
>> Thanks Maurizio,
>>
>> I raised this case mainly about clock_gettime and
>> recvmsg/sendmsg, I think we're focusing on the wrong things here.
>> Feel free to drop the two syscalls from the discussion entirely,
>> but the main usecases I have been presenting throughout this
>> thread definitely stand.
>>
>> Thanks
>>
>>
>> On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com> wrote:
>>
>> Hi Wojtek,
>> thanks for sharing this list, I think this is a good starting
>> point to understand more about your use case.
>>
>> Last week I've been looking at "getrusage" (as you mentioned
>> it in an earlier email), and I was surprised to see that the
>> call took a pointer to a (fairly big) struct which then
>> needed to be initialized with some thread-local state:
>>
>> https://man7.org/linux/man-pages/man2/getrusage.2.html
>> <https://urldefense.com/v3/__https://man7.org/linux/man-pages/man2/getrusage.2.html__;!!ACWV5N9M2RV99hQ!JOVYk-I1mh9kRUmzDJ4BiPfGDxtfUeTtegJ75C5HC_5PAqyLF9yuYyKc26CYFhOrXJjwoEhSaK7AGuCPyrxKyDuJXCLGiqw$>
>>
>> I've looked at the implementation, and it seems to be doing
>> memset on the user-provided struct pointer, plus all the
>> fields assignment. Eyeballing the implementation, this does
>> not seem to me like a "classic" use case where dropping
>> transition would help much. I mean, surely dropping
>> transitions would help shaving some nanoseconds off the call,
>> but it doesn't seem to me that the call would be shortlived
>> enough to make a difference. Do you have some benchmarks on
>> this one? I did some [1] and the call overhead seemed to come
>> up at 260ns/op - w/o transition you might perhaps be able to
>> get to 250ns, but that's in the noise?
>>
>> As for getpid, note that you can do (since Java 9):
>>
>> ProcessHandle.current().pid();
>>
>> I believe the impl caches the result, so it shouldn't even
>> make the native call.
>>
>> Maurizio
>>
>> [1] -
>> http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>>
>> On 02/07/2022 07:42, Wojciech Kudla wrote:
>>> Hi Maurizio,
>>>
>>> Thanks for staying on this.
>>>
>>> > Could you please provide a rough list of the native calls
>>> you make where you believe critical JNI is having a real
>>> impact in the performance of your application?
>>>
>>> From the top of my head:
>>> clock_gettime
>>> recvmsg
>>> recvmmsg
>>> sendmsg
>>> sendmmsg
>>> select
>>> getpid
>>> getcpu
>>> getrusage
>>>
>>> > Also, could you please tell us whether any of these calls
>>> need to interact with Java arrays?
>>> No arrays or objects of any type involved. Everything
>>> happens by the means of passing raw pointers as longs and
>>> using other primitive types as function arguments.
>>>
>>> > In other words, do you use critical JNI to remove the cost
>>> associated with thread transitions, or are you also taking
>>> advantage of accessing on-heap memory _directly_ from native
>>> code?
>>> Criticial JNI natives are used solely to remove the cost of
>>> transitions. We don't get anywhere near java heap in native
>>> code.
>>>
>>> In general I think it makes a lot of sense for Java as a
>>> language/platform to have some guards around unsafe code,
>>> but on the other hand the popularity of libraries employing
>>> Unsafe and their success in more performance-oriented
>>> corners of software engineering is a clear indicator there
>>> is a need for the JVM to provide access to more low-level
>>> primitives and mechanisms.
>>> I think it's entirely fair to tell developers that all bets
>>> are off when they get into some non-idiomatic scenarios but
>>> please don't take away a feature that greatly contributed to
>>> Java's success.
>>>
>>> Kind regards,
>>> Wojtek
>>>
>>> On Wed, Jun 29, 2022 at 5:20 PM Maurizio Cimadamore
>>> <maurizio.cimadamore at oracle.com> wrote:
>>>
>>> Hi Wojciech,
>>> picking up this thread again. After some internal
>>> discussion, we realize that we don't know enough about
>>> your use case. While re-enabling JNI critical would
>>> obviously provide a quick fix, we're afraid that (a)
>>> developers might end up depending on JNI critical when
>>> they don't need to (perhaps also unaware of the
>>> consequences of depending on it) and (b) that there
>>> might actually be _better_ (as in: much faster)
>>> solutions than using critical native calls to address at
>>> least some of your use cases (that seemed to be the case
>>> with the clock_gettime example you mentioned). Could you
>>> please provide a rough list of the native calls you make
>>> where you believe critical JNI is having a real impact
>>> in the performance of your application? Also, could you
>>> please tell us whether any of these calls need to
>>> interact with Java arrays? In other words, do you use
>>> critical JNI to remove the cost associated with thread
>>> transitions, or are you also taking advantage of
>>> accessing on-heap memory _directly_ from native code?
>>>
>>> Regards
>>> Maurizio
>>>
>>> On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>> Hi Mark,
>>>>
>>>> Thanks for your input and apologies for the delayed
>>>> response.
>>>>
>>>> > If the platform included, say, an intrinsified
>>>> System.nanoRealTime()
>>>> method that returned clock_gettime(CLOCK_REALTIME), how
>>>> much would
>>>> that help developers in your unnamed industry?
>>>>
>>>> Exposing realtime clock with nanosecond granularity in
>>>> the JDK would be a great step forward. I should have
>>>> made it clear that I represent fintech corner
>>>> (investment banking to be exact) but the issues my
>>>> message touches upon span areas such as HPC, audio
>>>> processing, gaming, and defense industry so it's not
>>>> like we have an isolated case.
>>>>
>>>> > In a similar vein, if people are finding it necessary
>>>> to “replace parts
>>>> of NIO with hand-crafted native code” then it would be
>>>> interesting to
>>>> understand what their requirements are
>>>>
>>>> As for the other example I provided with making very
>>>> short lived syscalls such as recvmsg/recvmmsg the
>>>> premise is getting access to hardware timestamps on the
>>>> ingress and egress ends as well as enabling batch
>>>> receive with a single syscall and otherwise exploiting
>>>> features unavailable from the JDK (like access to CMSG
>>>> interface, scatter/gather, etc).
>>>> There are also other examples of calls that we'd love
>>>> to make often and at lowest possible cost (ie.
>>>> getrusage) but I'm not sure if there's a strong case
>>>> for some of these ideas, that's why it might be worth
>>>> looking into more generic approach for performance
>>>> sensitive code.
>>>> Hope this does better job at explaining where we're
>>>> coming from than my previous messages.
>>>>
>>>> Thanks,
>>>> W
>>>>
>>>> On Tue, Jun 7, 2022 at 6:31 PM
>>>> <mark.reinhold at oracle.com> wrote:
>>>>
>>>> 2022/6/6 0:24:17 -0700, wkudla.kernel at gmail.com:
>>>> >> Yes for System.nanoTime(), but
>>>> System.currentTimeMillis() reports
>>>> >> CLOCK_REALTIME.
>>>> >
>>>> > Unfortunately System.currentTimeMillis() offers
>>>> only millisecond
>>>> > granularity which is the reason why our industry
>>>> has to resort to
>>>> > clock_gettime.
>>>>
>>>> If the platform included, say, an intrinsified
>>>> System.nanoRealTime()
>>>> method that returned clock_gettime(CLOCK_REALTIME),
>>>> how much would
>>>> that help developers in your unnamed industry?
>>>>
>>>> In a similar vein, if people are finding it
>>>> necessary to “replace parts
>>>> of NIO with hand-crafted native code” then it would
>>>> be interesting to
>>>> understand what their requirements are. Some
>>>> simple enhancements to
>>>> the NIO API would be much less costly to design and
>>>> implement than a
>>>> generalized user-level native-call intrinsification
>>>> mechanism.
>>>>
>>>> - Mark
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220704/9c1efd0f/attachment-0001.htm>
More information about the panama-dev
mailing list