Obsoleting JavaCritical
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Jul 7 14:36:17 UTC 2022
I'm dropping most of direct recipients and going back to just use
panama-dev and hotspot-dev, as it appears that our sever is having
issues in handling too many recipients (the message that got delivered
today was written few days ago :-) ).
I suggest everybody doing the same, and just use mailing lists for
further replies to this thread.
Cheers
Maurizio
On 05/07/2022 12:33, Maurizio Cimadamore wrote:
>
> Hi,
> As Erik explained in his reply, what we call "critical JNI" comes in
> two pieces: one removes Java to native thread transitions (which is
> what Wojciech is referring to), while another part interacts with the
> GC locker (basically to allow critical JNI code to access Java arrays
> w/o copying). I think the latter part is the most problematic GC-wise.
>
> Then, regarding the former, I think there are still questions as to
> whether dropping transitions is the best way to get the performance
> boost required; for instance, yesterday I did some experiments with an
> experimental patch from Jorn (kudos) which re-enables an opt-in for
> "trivial" native calls in the Panama API. I used it to test
> clock_gettime, and, while there's an improvement, the results I got
> were not as conclusive as one might expect expected. This is what I
> get w/ state transitions:
>
> ```
> Benchmark Mode Cnt Score Error Units
> ClockgettimeTest.panama_monotonic avgt 30 27.814 ± 0.165 ns/op
> ClockgettimeTest.panama_monotonic_coarse avgt 30 12.094 ± 0.103 ns/op
> ClockgettimeTest.panama_monotonic_raw avgt 30 27.719 ± 0.393 ns/op
> ClockgettimeTest.panama_realtime avgt 30 27.133 ± 0.280 ns/op
> ClockgettimeTest.panama_realtime_coarse avgt 30 26.812 ± 0.384 ns/op
> ```
>
> And this is what I get with transitions removed:
>
> ```
> Benchmark Mode Cnt Score Error Units
> ClockgettimeTest.panama_monotonic avgt 30 22.383 ± 0.213 ns/op
> ClockgettimeTest.panama_monotonic_coarse avgt 30 6.312 ± 0.117 ns/op
> ClockgettimeTest.panama_monotonic_raw avgt 30 22.731 ± 0.279 ns/op
> ClockgettimeTest.panama_realtime avgt 30 22.503 ± 0.292 ns/op
> ClockgettimeTest.panama_realtime_coarse avgt 30 21.853 ± 0.100 ns/op
>
> ```
>
> Here we can see a gain of 4-5ns, obtained by dropping the transition.
> The only case where this makes a significant difference is with the
> monotonic_coarse flavor. In the other cases there's a difference, yes,
> but not as pronounced, simply because the term we're comparing against
> is bigger: it's easy to see a 5ns gain if your function runs for 10ns
> in total - but such a gain starts to get lost in the "noise" when
> functions run for longer. And that's the main issue with removing
> Java->native transitions: the "window" in which this optimization
> yield a positive effect is extremely narrow (anything lasting longer
> than 30ns won't probably appreciate much difference), but, as you can
> see from the PR in [1], the VM changes required to support it touch
> quite a bit of stuff!
>
> Luckily, selectively disabling transitions from Panama is slightly
> more straightforward and, perhaps, for stuff like recvmsg syscalls
> that are bypassed, there's not much else we can do: while one could
> imagine Panama special-casing calls to clock_gettime, as that's a
> known "leaf", the same cannot be done with rcvmsg, which is in general
> a blocking call. Panama also has a "trusted mode" flag
> (--enable-native-access), so there is a way in the Panama API to
> distinguish between safe and unsafe API point, which also helps with
> this. The risk of course is for developers to see whatever mechanism
> is provided as some kind of "make my code go fast please" and apply it
> blindly, w/o fully understanding the consequences. What I said before
> about "extremely narrow window" remains true: in the vast majority of
> cases (like 99%) dropping state transitions can result in very big
> downsides, while the corresponding upsides are not big enough to even
> be noticeable (the Q/A in [2] arrives at a very similar conclusion).
>
> All this said, selectively disabling state transitions from native
> calls made using the Panama foreign API seem the most straightforward
> way to offset the performance delta introduced by the removal of
> critical JNI. In part it's because the Panama API is more flexible,
> e.g. function descriptors allows us to model the distinction between a
> trivial and non-trivial call; in part it's because, as stated above,
> Panama can already reason about calls that are "unsafe" and that
> require extra permissions. And, finally it's also because, if we added
> back critical JNI, we'd probably add it back w/o its most problematic
> GC locker parts (that's what [1] does AFAIK) - which means it won't be
> a complete code reversal. So, perhaps, coming up with a fresh
> mechanism to drop transitions (only) could also be less confusing for
> developers. Of course this would require developers such as Wojciech
> to rewrite some of the code to use Panama instead of JNI.
>
> And, coming back to clock_gettime, my feeling is that with the right
> tools (e.g. some intrinsics), we can make that go a lot faster than
> what shown above. Being able to quickly get a timestamp seems a
> widely-enough applicable use case to deserves some special treatment.
> So, perhaps, it's worth considering a _spectrum of solutions_ on how
> to improve the status quo, rather than investing solely on the removal
> of thread transitions.
>
> Maurizio
>
> [1] - https://github.com/openjdk/jdk19/pull/90/files
> [2] - https://youtu.be/LoyBTqkSkZk?t=742
>
>
> On 04/07/2022 18:38, Vitaly Davidovich wrote:
>> To not sidetrack this thread with my previous reply:
>>
>> Maurizio - are you saying java criticals are *already* hindering ZGC
>> and/or other planned Hotspot improvements? Or that theoretically they
>> could and you’d like to remove/deprecate them now(ish)?
>>
>> If it’s the former, perhaps it’s prudent to keep them around until a
>> compelling case surfaces where they preclude or severely restrict
>> evolution of the platform? If it’s the former, would be curious what
>> that is but would also understand the rationale behind wanting to
>> remove it.
>>
>> On Mon, Jul 4, 2022 at 1:26 PM Vitaly Davidovich <vitalyd at gmail.com>
>> wrote:
>>
>>
>>
>> On Mon, Jul 4, 2022 at 1:13 PM Wojciech Kudla
>> <wkudla.kernel at gmail.com> wrote:
>>
>> Thanks for your input, Vitaly. I'd be interested to find out
>> more about the nature of the HW noise you observed in your
>> benchmarks as our results were very consistent and it was
>> pretty straightforward to pinpoint the culprit as JNI call
>> overhead. Maybe it was just easier for us because we disallow
>> C- and P-state transitions and put a lot of effort to
>> eliminate platform jitter in general. Were you maybe running
>> on a CPU model that doesn't support constant TSC? I would
>> also suggest retrying with LAPIC interrupts suppressed (with:
>> cli/sti) to maybe see if it's the kernel and not the hardware.
>>
>> This was on a Broadwell Xeon chipset with constant tsc. All the
>> typical jitter sources were reduced: C/P states disabled in bios,
>> max turbo enabled, IRQs steered away, core isolated, etc. By the
>> way, by noise I don’t mean the results themselves were noisy -
>> they were constant run to run. I just meant the delta between
>> normal vs critical JNI entrypoints was very minimal - ie “in the
>> noise”, particularly with rdtsc.
>>
>> I can try to remeasure on newer Intel but see below …
>>
>>
>>
>> 100% agree on rdtsc(p) and snippets. There are some narrow
>> usecases were one can get some substantial speed ups with
>> direct access to prefetch or by abusing misprediction to keep
>> icache hot. These scenarios are sadly only available with
>> inline assembly. I know of a few shops that go to the length
>> of forking Graal, etc to achieve that but am quite convinced
>> such capabilities would be welcome and utilized by many more
>> groups if they were easily accessible from java.
>>
>> I’m of the firm (and perhaps controversial for some :)) opinion
>> these days that Java is simply the wrong platform/tool for low
>> latency cases that warrant this level of control. There’re very
>> strong headwinds even outside of JNI costs. And the “real”
>> problem with JNI, besides transition costs, is lack of inlining
>> into the native calls. So even if JVM transition costs are fully
>> eliminated, there’s still an optimization fence due to lost
>> inlining (not unlike native code calling native fns via shared libs).
>>
>> That’s not say that perf regressions are welcomed - nobody likes
>> those :).
>>
>>
>>
>> Thanks,
>> W.
>>
>> On Mon, Jul 4, 2022 at 5:51 PM Vitaly Davidovich
>> <vitalyd at gmail.com> wrote:
>>
>> I’d add rdtsc(p) wrapper functions to the list. These
>> are usually either inline asm or compiler intrinsic in
>> the JNI entrypoint. In addition, any native libs exposed
>> via JNI that have “trivial” functions are also candidates
>> for faster calling conventions. There’re sometimes way
>> to mitigate the call overhead (eg batching) but it’s not
>> always feasible.
>>
>> I’ll add that last time I tried to measure the
>> improvement of Java criticals for clock_gettime (and
>> rdtsc) it looked to be in the noise on the hardware I was
>> testing on. It got the point where I had to instrument
>> the critical and normal JNI entrypoints to confirm the
>> critical was being hit. The critical calling convention
>> isn’t significantly different *if* basic primitives (or
>> no args at all) are passed as args. JNIEnv*, IIRC, is
>> loaded from a register so that’s minor. jclass (for
>> static calls, which is what’s relevant here) should be a
>> compiled constant. Critical call still has a GCLocker
>> check. So I’m not actually sure what the significant
>> difference is for “lightweight” (ie few primitive or no
>> args, primitive return types) calls.
>>
>> In general, I do think it’d be nice if there was a faster
>> native call sequence, even if it comes with a caveat
>> emptor and/or special requirements on the callee (not
>> unlike the requirements for criticals). I think Vladimir
>> Ivanov was working on “snippets” that allowed dynamic
>> construction of a native call, possibly including
>> assembly. Not sure where that exploration is these days,
>> but that would be a welcome capability.
>>
>> My $.02. Happy 4th of July for those celebrating!
>>
>> Vitaly
>>
>> On Mon, Jul 4, 2022 at 12:04 PM Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com> wrote:
>>
>> Hi,
>> while I'm not an expert with some of the IO calls you
>> mention (some of my colleagues are more knowledgeable
>> in this area, so I'm sure they will have more info),
>> my general sense is that, as with getrusage, if there
>> is a system call involved, you already pay a hefty
>> price for the user to kernel transition. On my
>> machine this seem to cost around 200ns. In these
>> cases, using JNI critical to shave off a dozen of
>> nanoseconds (at best!) seems just not worth it.
>>
>> So, of the functions in your list, the ones in which
>> I *believe* dropping transitions would have the most
>> effect are (if we exclude getpid, for which another
>> approach is possible) clock_gettime and getcpu, I
>> believe, as they might use vdso [1], which typically
>> brings the performance of these call closer to calls
>> to shared lib functions.
>>
>> If you have examples e.g. where performance of
>> recvmsg (or related calls) varies significantly
>> between base JNI and critical JNI, please send them
>> our way; I'm sure some of my colleagues would be
>> intersted to take a look.
>>
>> Popping back a couple of levels, I think it would be
>> helpful to also define what's an acceptable
>> regression in this context. Of course, in an ideal
>> world, we'd like to see no performance regression at
>> all. But JNI critical is an unsupported interface,
>> which might misbehave with modern garbage collectors
>> (e.g. ZGC) and that requires quite a bit of internal
>> complexity which might, in the medium/long run,
>> hinder the evolution of the Java platform (all these
>> things have _some_ cost, even if the cost is not
>> directly material to developers). In this vein, I
>> think calls like clock_gettime tend to be more
>> problematic: as they complete very quickly, you see
>> the cost of transitions a lot more. In other cases,
>> where syscalls are involved, the cost associated to
>> transitions are more likely to be "in the noise". Of
>> course if we look at absolute numbers, dropping
>> transitions would always yield "faster" code; but at
>> the same time, going from 250ns to 245ns is very
>> unlikely to result in visible performance difference
>> when considering an application as a whole, so I
>> think it's critical here to decide _which_ use cases
>> to prioritize.
>>
>> I think a good outcome of this discussion would be if
>> we could come to some shared understanding of which
>> native calls are truly problematic (e.g.
>> clock_gettime-like), and then for the JDK to provide
>> better (and more maintainable) alternatives for those
>> (which might even be faster than using critical JNI).
>>
>> Thanks
>> Maurizio
>>
>> [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
>>
>> On 04/07/2022 12:23, Wojciech Kudla wrote:
>>> Thanks Maurizio,
>>>
>>> I raised this case mainly about clock_gettime and
>>> recvmsg/sendmsg, I think we're focusing on the wrong
>>> things here. Feel free to drop the two syscalls from
>>> the discussion entirely, but the main usecases I
>>> have been presenting throughout this thread
>>> definitely stand.
>>>
>>> Thanks
>>>
>>>
>>> On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore
>>> <maurizio.cimadamore at oracle.com> wrote:
>>>
>>> Hi Wojtek,
>>> thanks for sharing this list, I think this is a
>>> good starting point to understand more about
>>> your use case.
>>>
>>> Last week I've been looking at "getrusage" (as
>>> you mentioned it in an earlier email), and I was
>>> surprised to see that the call took a pointer to
>>> a (fairly big) struct which then needed to be
>>> initialized with some thread-local state:
>>>
>>> https://man7.org/linux/man-pages/man2/getrusage.2.html
>>>
>>> I've looked at the implementation, and it seems
>>> to be doing memset on the user-provided struct
>>> pointer, plus all the fields assignment.
>>> Eyeballing the implementation, this does not
>>> seem to me like a "classic" use case where
>>> dropping transition would help much. I mean,
>>> surely dropping transitions would help shaving
>>> some nanoseconds off the call, but it doesn't
>>> seem to me that the call would be shortlived
>>> enough to make a difference. Do you have some
>>> benchmarks on this one? I did some [1] and the
>>> call overhead seemed to come up at 260ns/op -
>>> w/o transition you might perhaps be able to get
>>> to 250ns, but that's in the noise?
>>>
>>> As for getpid, note that you can do (since Java 9):
>>>
>>> ProcessHandle.current().pid();
>>>
>>> I believe the impl caches the result, so it
>>> shouldn't even make the native call.
>>>
>>> Maurizio
>>>
>>> [1] -
>>> http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>>>
>>> On 02/07/2022 07:42, Wojciech Kudla wrote:
>>>> Hi Maurizio,
>>>>
>>>> Thanks for staying on this.
>>>>
>>>> > Could you please provide a rough list of the
>>>> native calls you make where you believe
>>>> critical JNI is having a real impact in the
>>>> performance of your application?
>>>>
>>>> From the top of my head:
>>>> clock_gettime
>>>> recvmsg
>>>> recvmmsg
>>>> sendmsg
>>>> sendmmsg
>>>> select
>>>> getpid
>>>> getcpu
>>>> getrusage
>>>>
>>>> > Also, could you please tell us whether any of
>>>> these calls need to interact with Java arrays?
>>>> No arrays or objects of any type involved.
>>>> Everything happens by the means of passing raw
>>>> pointers as longs and using other primitive
>>>> types as function arguments.
>>>>
>>>> > In other words, do you use critical JNI to
>>>> remove the cost associated with thread
>>>> transitions, or are you also taking advantage
>>>> of accessing on-heap memory _directly_ from
>>>> native code?
>>>> Criticial JNI natives are used solely to remove
>>>> the cost of transitions. We don't get anywhere
>>>> near java heap in native code.
>>>>
>>>> In general I think it makes a lot of sense for
>>>> Java as a language/platform to have some guards
>>>> around unsafe code, but on the other hand the
>>>> popularity of libraries employing Unsafe and
>>>> their success in more performance-oriented
>>>> corners of software engineering is a clear
>>>> indicator there is a need for the JVM to
>>>> provide access to more low-level primitives and
>>>> mechanisms.
>>>> I think it's entirely fair to tell developers
>>>> that all bets are off when they get into some
>>>> non-idiomatic scenarios but please don't take
>>>> away a feature that greatly contributed to
>>>> Java's success.
>>>>
>>>> Kind regards,
>>>> Wojtek
>>>>
>>>> On Wed, Jun 29, 2022 at 5:20 PM Maurizio
>>>> Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>>>
>>>> Hi Wojciech,
>>>> picking up this thread again. After some
>>>> internal discussion, we realize that we
>>>> don't know enough about your use case.
>>>> While re-enabling JNI critical would
>>>> obviously provide a quick fix, we're afraid
>>>> that (a) developers might end up depending
>>>> on JNI critical when they don't need to
>>>> (perhaps also unaware of the consequences
>>>> of depending on it) and (b) that there
>>>> might actually be _better_ (as in: much
>>>> faster) solutions than using critical
>>>> native calls to address at least some of
>>>> your use cases (that seemed to be the case
>>>> with the clock_gettime example you
>>>> mentioned). Could you please provide a
>>>> rough list of the native calls you make
>>>> where you believe critical JNI is having a
>>>> real impact in the performance of your
>>>> application? Also, could you please tell us
>>>> whether any of these calls need to interact
>>>> with Java arrays? In other words, do you
>>>> use critical JNI to remove the cost
>>>> associated with thread transitions, or are
>>>> you also taking advantage of accessing
>>>> on-heap memory _directly_ from native code?
>>>>
>>>> Regards
>>>> Maurizio
>>>>
>>>> On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>>> Hi Mark,
>>>>>
>>>>> Thanks for your input and apologies for
>>>>> the delayed response.
>>>>>
>>>>> > If the platform included, say, an
>>>>> intrinsified System.nanoRealTime()
>>>>> method that returned
>>>>> clock_gettime(CLOCK_REALTIME), how much would
>>>>> that help developers in your unnamed industry?
>>>>>
>>>>> Exposing realtime clock with nanosecond
>>>>> granularity in the JDK would be a great
>>>>> step forward. I should have made it clear
>>>>> that I represent fintech corner
>>>>> (investment banking to be exact) but the
>>>>> issues my message touches upon span areas
>>>>> such as HPC, audio processing, gaming, and
>>>>> defense industry so it's not like we have
>>>>> an isolated case.
>>>>>
>>>>> > In a similar vein, if people are finding
>>>>> it necessary to “replace parts
>>>>> of NIO with hand-crafted native code” then
>>>>> it would be interesting to
>>>>> understand what their requirements are
>>>>>
>>>>> As for the other example I provided with
>>>>> making very short lived syscalls such as
>>>>> recvmsg/recvmmsg the premise is getting
>>>>> access to hardware timestamps on the
>>>>> ingress and egress ends as well as
>>>>> enabling batch receive with a single
>>>>> syscall and otherwise exploiting features
>>>>> unavailable from the JDK (like access to
>>>>> CMSG interface, scatter/gather, etc).
>>>>> There are also other examples of calls
>>>>> that we'd love to make often and at lowest
>>>>> possible cost (ie. getrusage) but I'm not
>>>>> sure if there's a strong case for some of
>>>>> these ideas, that's why it might be worth
>>>>> looking into more generic approach for
>>>>> performance sensitive code.
>>>>> Hope this does better job at explaining
>>>>> where we're coming from than my previous
>>>>> messages.
>>>>>
>>>>> Thanks,
>>>>> W
>>>>>
>>>>> On Tue, Jun 7, 2022 at 6:31 PM
>>>>> <mark.reinhold at oracle.com> wrote:
>>>>>
>>>>> 2022/6/6 0:24:17 -0700,
>>>>> wkudla.kernel at gmail.com:
>>>>> >> Yes for System.nanoTime(), but
>>>>> System.currentTimeMillis() reports
>>>>> >> CLOCK_REALTIME.
>>>>> >
>>>>> > Unfortunately
>>>>> System.currentTimeMillis() offers only
>>>>> millisecond
>>>>> > granularity which is the reason why
>>>>> our industry has to resort to
>>>>> > clock_gettime.
>>>>>
>>>>> If the platform included, say, an
>>>>> intrinsified System.nanoRealTime()
>>>>> method that returned
>>>>> clock_gettime(CLOCK_REALTIME), how
>>>>> much would
>>>>> that help developers in your unnamed
>>>>> industry?
>>>>>
>>>>> In a similar vein, if people are
>>>>> finding it necessary to “replace parts
>>>>> of NIO with hand-crafted native code”
>>>>> then it would be interesting to
>>>>> understand what their requirements
>>>>> are. Some simple enhancements to
>>>>> the NIO API would be much less costly
>>>>> to design and implement than a
>>>>> generalized user-level native-call
>>>>> intrinsification mechanism.
>>>>>
>>>>> - Mark
>>>>>
>> --
>> Sent from my phone
>>
>> --
>> Sent from my phone
>>
>> --
>> Sent from my phone
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220707/9a244ced/attachment-0001.htm>
More information about the panama-dev
mailing list