Obsoleting JavaCritical

Mon Jul 4 17:39:52 UTC 2022

On Mon, Jul 4, 2022 at 1:38 PM Vitaly Davidovich <vitalyd at gmail.com> wrote:

> To not sidetrack this thread with my previous reply:
>
> Maurizio - are you saying java criticals are *already* hindering ZGC
> and/or other planned Hotspot improvements? Or that theoretically they could
> and you’d like to remove/deprecate them now(ish)?
>
> If it’s the former,
>
Argh, sorry - meant to say if it’s the latter.

> perhaps it’s prudent to keep them around until a compelling case surfaces
> where they preclude or severely restrict evolution of the platform? If it’s
> the former, would be curious what that is but would also understand the
> rationale behind wanting to remove it.
>

> On Mon, Jul 4, 2022 at 1:26 PM Vitaly Davidovich <vitalyd at gmail.com>
> wrote:
>
>>
>>
>> On Mon, Jul 4, 2022 at 1:13 PM Wojciech Kudla <wkudla.kernel at gmail.com>
>> wrote:
>>
>>> Thanks for your input, Vitaly. I'd be interested to find out more about
>>> the nature of the HW noise you observed in your benchmarks as our results
>>> were very consistent and it was pretty straightforward to pinpoint the
>>> culprit as JNI call overhead. Maybe it was just easier for us because we
>>> disallow C- and P-state transitions and put a lot of effort to eliminate
>>> platform jitter in general. Were you maybe running on a CPU model that
>>> doesn't support constant TSC? I would also suggest retrying with LAPIC
>>> interrupts suppressed (with: cli/sti) to maybe see if it's the kernel and
>>> not the hardware.
>>>
>> This was on a Broadwell Xeon chipset with constant tsc.  All the typical
>> jitter sources were reduced: C/P states disabled in bios, max turbo
>> enabled, IRQs steered away, core isolated, etc.  By the way, by noise I
>> don’t mean the results themselves were noisy - they were constant run to
>> run.  I just meant the delta between normal vs critical JNI entrypoints was
>> very minimal - ie “in the noise”, particularly with rdtsc.
>>
>> I can try to remeasure on newer Intel but see below …
>>
>>>
>>>
>>> 100% agree on rdtsc(p) and snippets. There are some narrow usecases were
>>> one can get some substantial speed ups with direct access to prefetch or by
>>> abusing misprediction to keep icache hot. These scenarios are sadly only
>>> available with inline assembly. I know of a few shops that go to the length
>>> of forking Graal, etc to achieve that but am quite convinced such
>>> capabilities would be welcome and utilized by many more groups if they were
>>> easily accessible from java.
>>>
>> I’m of the firm (and perhaps controversial for some :)) opinion these
>> days that Java is simply the wrong platform/tool for low latency cases that
>> warrant this level of control.  There’re very strong headwinds even outside
>> of JNI costs.  And the “real” problem with JNI, besides transition costs,
>> is lack of inlining into the native calls.  So even if JVM transition costs
>> are fully eliminated, there’s still an optimization fence due to lost
>> inlining (not unlike native code calling native fns via shared libs).
>>
>> That’s not say that perf regressions are welcomed - nobody likes those :).
>>
>>>
>>>
>>> Thanks,
>>> W.
>>>
>>> On Mon, Jul 4, 2022 at 5:51 PM Vitaly Davidovich <vitalyd at gmail.com>
>>> wrote:
>>>
>>>> I’d add rdtsc(p) wrapper functions to the list.  These are usually
>>>> either inline asm or compiler intrinsic in the JNI entrypoint.  In
>>>> addition, any native libs exposed via JNI that have “trivial” functions are
>>>> also candidates for faster calling conventions.  There’re sometimes way to
>>>> mitigate the call overhead (eg batching) but it’s not always feasible.
>>>>
>>>> I’ll add that last time I tried to measure the improvement of Java
>>>> criticals for clock_gettime (and rdtsc) it looked to be in the noise on the
>>>> hardware I was testing on.  It got the point where I had to instrument the
>>>> critical and normal JNI entrypoints to confirm the critical was being hit.
>>>> The critical calling convention isn’t significantly different *if* basic
>>>> primitives (or no args at all) are passed as args.  JNIEnv*, IIRC, is
>>>> loaded from a register so that’s minor.  jclass (for static calls, which is
>>>> what’s relevant here) should be a compiled constant.  Critical call still
>>>> has a GCLocker check.  So I’m not actually sure what the significant
>>>> difference is for “lightweight” (ie few primitive or no args, primitive
>>>> return types) calls.
>>>>
>>>> In general, I do think it’d be nice if there was a faster native call
>>>> sequence, even if it comes with a caveat emptor and/or special requirements
>>>> on the callee (not unlike the requirements for criticals).  I think
>>>> Vladimir Ivanov was working on “snippets” that allowed dynamic construction
>>>> of a native call, possibly including assembly.  Not sure where that
>>>> exploration is these days, but that would be a welcome capability.
>>>>
>>>> My $.02.  Happy 4th of July for those celebrating!
>>>>
>>>> Vitaly
>>>>
>>>> On Mon, Jul 4, 2022 at 12:04 PM Maurizio Cimadamore <
>>>> maurizio.cimadamore at oracle.com> wrote:
>>>>
>>>>> Hi,
>>>>> while I'm not an expert with some of the IO calls you mention (some of
>>>>> my colleagues are more knowledgeable in this area, so I'm sure they will
>>>>> have more info), my general sense is that, as with getrusage, if there is a
>>>>> system call involved, you already pay a hefty price for the user to kernel
>>>>> transition. On my machine this seem to cost around 200ns. In these cases,
>>>>> using JNI critical to shave off a dozen of nanoseconds (at best!) seems
>>>>> just not worth it.
>>>>>
>>>>> So, of the functions in your list, the ones in which I *believe*
>>>>> dropping transitions would have the most effect are (if we exclude getpid,
>>>>> for which another approach is possible) clock_gettime and getcpu, I
>>>>> believe, as they might use vdso [1], which typically brings the performance
>>>>> of these call closer to calls to shared lib functions.
>>>>>
>>>>> If you have examples e.g. where performance of recvmsg (or related
>>>>> calls) varies significantly between base JNI and critical JNI, please send
>>>>> them our way; I'm sure some of my colleagues would be intersted to take a
>>>>> look.
>>>>>
>>>>> Popping back a couple of levels, I think it would be helpful to also
>>>>> define what's an acceptable regression in this context. Of course, in an
>>>>> ideal world,  we'd like to see no performance regression at all. But JNI
>>>>> critical is an unsupported interface, which might misbehave with modern
>>>>> garbage collectors (e.g. ZGC) and that requires quite a bit of internal
>>>>> complexity which might, in the medium/long run, hinder the evolution of the
>>>>> Java platform (all these things have _some_ cost, even if the cost is not
>>>>> directly material to developers). In this vein, I think calls like
>>>>> clock_gettime tend to be more problematic: as they complete very quickly,
>>>>> you see the cost of transitions a lot more. In other cases, where syscalls
>>>>> are involved, the cost associated to transitions are more likely to be "in
>>>>> the noise". Of course if we look at absolute numbers, dropping transitions
>>>>> would always yield "faster" code; but at the same time, going from 250ns to
>>>>> 245ns is very unlikely to result in visible performance difference when
>>>>> considering an application as a whole, so I think it's critical here to
>>>>> decide _which_ use cases to prioritize.
>>>>>
>>>>> I think a good outcome of this discussion would be if we could come to
>>>>> some shared understanding of which native calls are truly problematic (e.g.
>>>>> clock_gettime-like), and then for the JDK to provide better (and more
>>>>> maintainable) alternatives for those (which might even be faster than using
>>>>> critical JNI).
>>>>>
>>>>> Thanks
>>>>> Maurizio
>>>>>
>>>>> [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
>>>>> On 04/07/2022 12:23, Wojciech Kudla wrote:
>>>>>
>>>>> Thanks Maurizio,
>>>>>
>>>>> I raised this case mainly about clock_gettime and recvmsg/sendmsg, I
>>>>> think we're focusing on the wrong things here. Feel free to drop the two
>>>>> syscalls from the discussion entirely, but the main usecases I have been
>>>>> presenting throughout this thread definitely stand.
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore <
>>>>> maurizio.cimadamore at oracle.com> wrote:
>>>>>
>>>>>> Hi Wojtek,
>>>>>> thanks for sharing this list, I think this is a good starting point
>>>>>> to understand more about your use case.
>>>>>>
>>>>>> Last week I've been looking at "getrusage" (as you mentioned it in an
>>>>>> earlier email), and I was surprised to see that the call took a pointer to
>>>>>> a (fairly big) struct which then needed to be initialized with some
>>>>>> thread-local state:
>>>>>>
>>>>>> https://man7.org/linux/man-pages/man2/getrusage.2.html
>>>>>>
>>>>>> I've looked at the implementation, and it seems to be doing memset on
>>>>>> the user-provided struct pointer, plus all the fields assignment.
>>>>>> Eyeballing the implementation, this does not seem to me like a "classic"
>>>>>> use case where dropping transition would help much. I mean, surely dropping
>>>>>> transitions would help shaving some nanoseconds off the call, but it
>>>>>> doesn't seem to me that the call would be shortlived enough to make a
>>>>>> difference. Do you have some benchmarks on this one? I did some [1] and the
>>>>>> call overhead seemed to come up at 260ns/op - w/o transition you might
>>>>>> perhaps be able to get to 250ns, but that's in the noise?
>>>>>>
>>>>>> As for getpid, note that you can do (since Java 9):
>>>>>>
>>>>>> ProcessHandle.current().pid();
>>>>>>
>>>>>> I believe the impl caches the result, so it shouldn't even make the
>>>>>> native call.
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>> [1] -
>>>>>> http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>>>>>> On 02/07/2022 07:42, Wojciech Kudla wrote:
>>>>>>
>>>>>> Hi Maurizio,
>>>>>>
>>>>>> Thanks for staying on this.
>>>>>>
>>>>>> > Could you please provide a rough list of the native calls you make
>>>>>> where you believe critical JNI is having a real impact in the performance
>>>>>> of your application?
>>>>>>
>>>>>> From the top of my head:
>>>>>> clock_gettime
>>>>>> recvmsg
>>>>>> recvmmsg
>>>>>> sendmsg
>>>>>> sendmmsg
>>>>>> select
>>>>>> getpid
>>>>>> getcpu
>>>>>> getrusage
>>>>>>
>>>>>> > Also, could you please tell us whether any of these calls need to
>>>>>> interact with Java arrays?
>>>>>> No arrays or objects of any type involved. Everything happens by the
>>>>>> means of passing raw pointers as longs and using other primitive types as
>>>>>> function arguments.
>>>>>>
>>>>>> > In other words, do you use critical JNI to remove the cost
>>>>>> associated with thread transitions, or are you also taking advantage of
>>>>>> accessing on-heap memory _directly_ from native code?
>>>>>> Criticial JNI natives are used solely to remove the cost of
>>>>>> transitions. We don't get anywhere near java heap in native code.
>>>>>>
>>>>>> In general I think it makes a lot of sense for Java as a
>>>>>> language/platform to have some guards around unsafe code, but on the other
>>>>>> hand the popularity of libraries employing Unsafe and their success in more
>>>>>> performance-oriented corners of software engineering is a clear indicator
>>>>>> there is a need for the JVM to provide access to more low-level primitives
>>>>>> and mechanisms.
>>>>>> I think it's entirely fair to tell developers that all bets are off
>>>>>> when they get into some non-idiomatic scenarios but please don't take away
>>>>>> a feature that greatly contributed to Java's success.
>>>>>>
>>>>>> Kind regards,
>>>>>> Wojtek
>>>>>>
>>>>>> On Wed, Jun 29, 2022 at 5:20 PM Maurizio Cimadamore <
>>>>>> maurizio.cimadamore at oracle.com> wrote:
>>>>>>
>>>>>>> Hi Wojciech,
>>>>>>> picking up this thread again. After some internal discussion, we
>>>>>>> realize that we don't know enough about your use case. While re-enabling
>>>>>>> JNI critical would obviously provide a quick fix, we're afraid that (a)
>>>>>>> developers might end up depending on JNI critical when they don't need to
>>>>>>> (perhaps also unaware of the consequences of depending on it) and (b) that
>>>>>>> there might actually be _better_ (as in: much faster) solutions than using
>>>>>>> critical native calls to address at least some of your use cases (that
>>>>>>> seemed to be the case with the clock_gettime example you mentioned). Could
>>>>>>> you please provide a rough list of the native calls you make where you
>>>>>>> believe critical JNI is having a real impact in the performance of your
>>>>>>> application? Also, could you please tell us whether any of these calls need
>>>>>>> to interact with Java arrays? In other words, do you use critical JNI to
>>>>>>> remove the cost associated with thread transitions, or are you also taking
>>>>>>> advantage of accessing on-heap memory _directly_ from native code?
>>>>>>>
>>>>>>> Regards
>>>>>>> Maurizio
>>>>>>> On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>>>>>
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> Thanks for your input and apologies for the delayed response.
>>>>>>>
>>>>>>> > If the platform included, say, an intrinsified
>>>>>>> System.nanoRealTime()
>>>>>>> method that returned clock_gettime(CLOCK_REALTIME), how much would
>>>>>>> that help developers in your unnamed industry?
>>>>>>>
>>>>>>> Exposing realtime clock with nanosecond granularity in the JDK would
>>>>>>> be a great step forward. I should have made it clear that I represent
>>>>>>> fintech corner (investment banking to be exact) but the issues my message
>>>>>>> touches upon span areas such as HPC, audio processing, gaming, and defense
>>>>>>> industry so it's not like we have an isolated case.
>>>>>>>
>>>>>>> > In a similar vein, if people are finding it necessary to “replace
>>>>>>> parts
>>>>>>> of NIO with hand-crafted native code” then it would be interesting to
>>>>>>> understand what their requirements are
>>>>>>>
>>>>>>> As for the other example I provided with making very short lived
>>>>>>> syscalls such as recvmsg/recvmmsg the premise is getting access to hardware
>>>>>>> timestamps on the ingress and egress ends as well as enabling batch receive
>>>>>>> with a single syscall and otherwise exploiting features unavailable from
>>>>>>> the JDK (like access to CMSG interface, scatter/gather, etc).
>>>>>>> There are also other examples of calls that we'd love to make often
>>>>>>> and at lowest possible cost (ie. getrusage) but I'm not sure if there's a
>>>>>>> strong case for some of these ideas, that's why it might be worth looking
>>>>>>> into more generic approach for performance sensitive code.
>>>>>>> Hope this does better job at explaining where we're coming from than
>>>>>>> my previous messages.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> W
>>>>>>>
>>>>>>> On Tue, Jun 7, 2022 at 6:31 PM <mark.reinhold at oracle.com> wrote:
>>>>>>>
>>>>>>>> 2022/6/6 0:24:17 -0700, wkudla.kernel at gmail.com:
>>>>>>>> >> Yes for System.nanoTime(), but System.currentTimeMillis() reports
>>>>>>>> >> CLOCK_REALTIME.
>>>>>>>> >
>>>>>>>> > Unfortunately System.currentTimeMillis() offers only millisecond
>>>>>>>> > granularity which is the reason why our industry has to resort to
>>>>>>>> > clock_gettime.
>>>>>>>>
>>>>>>>> If the platform included, say, an intrinsified System.nanoRealTime()
>>>>>>>> method that returned clock_gettime(CLOCK_REALTIME), how much would
>>>>>>>> that help developers in your unnamed industry?
>>>>>>>>
>>>>>>>> In a similar vein, if people are finding it necessary to “replace
>>>>>>>> parts
>>>>>>>> of NIO with hand-crafted native code” then it would be interesting
>>>>>>>> to
>>>>>>>> understand what their requirements are.  Some simple enhancements to
>>>>>>>> the NIO API would be much less costly to design and implement than a
>>>>>>>> generalized user-level native-call intrinsification mechanism.
>>>>>>>>
>>>>>>>> - Mark
>>>>>>>>
>>>>>>> --
>>>> Sent from my phone
>>>>
>>> --
>> Sent from my phone
>>
> --
> Sent from my phone
>
-- 
Sent from my phone
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220704/78514c96/attachment-0001.htm>