Obsoleting JavaCritical

Thu Jul 7 14:36:17 UTC 2022

I'm dropping most of direct recipients and going back to just use 
panama-dev and hotspot-dev, as it appears that our sever is having 
issues in handling too many recipients (the message that got delivered 
today was written few days ago :-) ).

I suggest everybody doing the same, and just use mailing lists for 
further replies to this thread.

Cheers
Maurizio

On 05/07/2022 12:33, Maurizio Cimadamore wrote:
>
> Hi,
> As Erik explained in his reply, what we call "critical JNI" comes in 
> two pieces: one removes Java to native thread transitions (which is 
> what Wojciech is referring to), while another part interacts with the 
> GC locker (basically to allow critical JNI code to access Java arrays 
> w/o copying). I think the latter part is the most problematic GC-wise.
>
> Then, regarding the former, I think there are still questions as to 
> whether dropping transitions is the best way to get the performance 
> boost required; for instance, yesterday I did some experiments with an 
> experimental patch from Jorn (kudos) which re-enables an opt-in for 
> "trivial" native calls in the Panama API. I used it to test 
> clock_gettime, and, while there's an improvement, the results I got 
> were not as conclusive as one might expect expected. This is what I 
> get w/ state transitions:
>
> ```
> Benchmark                                 Mode  Cnt   Score Error  Units
> ClockgettimeTest.panama_monotonic         avgt   30  27.814 ± 0.165  ns/op
> ClockgettimeTest.panama_monotonic_coarse  avgt   30  12.094 ± 0.103  ns/op
> ClockgettimeTest.panama_monotonic_raw     avgt   30  27.719 ± 0.393  ns/op
> ClockgettimeTest.panama_realtime          avgt   30  27.133 ± 0.280  ns/op
> ClockgettimeTest.panama_realtime_coarse   avgt   30  26.812 ± 0.384  ns/op
> ```
>
> And this is what I get with transitions removed:
>
> ```
> Benchmark                                 Mode  Cnt   Score Error  Units
> ClockgettimeTest.panama_monotonic         avgt   30  22.383 ± 0.213  ns/op
> ClockgettimeTest.panama_monotonic_coarse  avgt   30   6.312 ± 0.117  ns/op
> ClockgettimeTest.panama_monotonic_raw     avgt   30  22.731 ± 0.279  ns/op
> ClockgettimeTest.panama_realtime          avgt   30  22.503 ± 0.292  ns/op
> ClockgettimeTest.panama_realtime_coarse   avgt   30  21.853 ± 0.100  ns/op
>
> ```
>
> Here we can see a gain of 4-5ns, obtained by dropping the transition. 
> The only case where this makes a significant difference is with the 
> monotonic_coarse flavor. In the other cases there's a difference, yes, 
> but not as pronounced, simply because the term we're comparing against 
> is bigger: it's easy to see a 5ns gain if your function runs for 10ns 
> in total - but such a gain starts to get lost in the "noise" when 
> functions run for longer. And that's the main issue with removing 
> Java->native transitions: the "window" in which this optimization 
> yield a positive effect is extremely narrow (anything lasting longer 
> than 30ns won't probably appreciate much difference), but, as you can 
> see from the PR in [1], the VM changes required to support it touch 
> quite a bit of stuff!
>
> Luckily, selectively disabling transitions from Panama is slightly 
> more straightforward and, perhaps, for stuff like recvmsg syscalls 
> that are bypassed, there's not much else we can do: while one could 
> imagine Panama special-casing calls to clock_gettime, as that's a 
> known "leaf", the same cannot be done with rcvmsg, which is in general 
> a blocking call. Panama also has a "trusted mode" flag 
> (--enable-native-access), so there is a way in the Panama API to 
> distinguish between safe and unsafe API point, which also helps with 
> this. The risk of course is for developers to see whatever mechanism 
> is provided as some kind of "make my code go fast please" and apply it 
> blindly, w/o fully understanding the consequences. What I said before 
> about "extremely narrow window" remains true: in the vast majority of 
> cases (like 99%) dropping state transitions can result in very big 
> downsides, while the corresponding upsides are not big enough to even 
> be noticeable (the Q/A in [2] arrives at a very similar conclusion).
>
> All this said, selectively disabling state transitions from native 
> calls made using the Panama foreign API seem the most straightforward 
> way to offset the performance delta introduced by the removal of 
> critical JNI. In part it's because the Panama API is more flexible, 
> e.g. function descriptors allows us to model the distinction between a 
> trivial and non-trivial call; in part it's because, as stated above, 
> Panama can already reason about calls that are "unsafe" and that 
> require extra permissions. And, finally it's also because, if we added 
> back critical JNI, we'd probably add it back w/o its most problematic 
> GC locker parts (that's what [1] does AFAIK) - which means it won't be 
> a complete code reversal. So, perhaps, coming up with a fresh 
> mechanism to drop transitions (only) could also be less confusing for 
> developers. Of course this would require developers such as Wojciech 
> to rewrite some of the code to use Panama instead of JNI.
>
> And, coming back to clock_gettime, my feeling is that with the right 
> tools (e.g. some intrinsics), we can make that go a lot faster than 
> what shown above. Being able to quickly get a timestamp seems a 
> widely-enough applicable use case to deserves some special treatment. 
> So, perhaps, it's worth considering a _spectrum of solutions_ on how 
> to improve the status quo, rather than investing solely on the removal 
> of thread transitions.
>
> Maurizio
>
> [1] - https://github.com/openjdk/jdk19/pull/90/files
> [2] - https://youtu.be/LoyBTqkSkZk?t=742
>
>
> On 04/07/2022 18:38, Vitaly Davidovich wrote:
>> To not sidetrack this thread with my previous reply:
>>
>> Maurizio - are you saying java criticals are *already* hindering ZGC 
>> and/or other planned Hotspot improvements? Or that theoretically they 
>> could and you’d like to remove/deprecate them now(ish)?
>>
>> If it’s the former, perhaps it’s prudent to keep them around until a 
>> compelling case surfaces where they preclude or severely restrict 
>> evolution of the platform? If it’s the former, would be curious what 
>> that is but would also understand the rationale behind wanting to 
>> remove it.
>>
>> On Mon, Jul 4, 2022 at 1:26 PM Vitaly Davidovich <vitalyd at gmail.com> 
>> wrote:
>>
>>
>>
>>     On Mon, Jul 4, 2022 at 1:13 PM Wojciech Kudla
>>     <wkudla.kernel at gmail.com> wrote:
>>
>>         Thanks for your input, Vitaly. I'd be interested to find out
>>         more about the nature of the HW noise you observed in your
>>         benchmarks as our results were very consistent and it was
>>         pretty straightforward to pinpoint the culprit as JNI call
>>         overhead. Maybe it was just easier for us because we disallow
>>         C- and P-state transitions and put a lot of effort to
>>         eliminate platform jitter in general. Were you maybe running
>>         on a CPU model that doesn't support constant TSC? I would
>>         also suggest retrying with LAPIC interrupts suppressed (with:
>>         cli/sti) to maybe see if it's the kernel and not the hardware.
>>
>>     This was on a Broadwell Xeon chipset with constant tsc.  All the
>>     typical jitter sources were reduced: C/P states disabled in bios,
>>     max turbo enabled, IRQs steered away, core isolated, etc.  By the
>>     way, by noise I don’t mean the results themselves were noisy -
>>     they were constant run to run.  I just meant the delta between
>>     normal vs critical JNI entrypoints was very minimal - ie “in the
>>     noise”, particularly with rdtsc.
>>
>>     I can try to remeasure on newer Intel but see below …
>>
>>
>>
>>         100% agree on rdtsc(p) and snippets. There are some narrow
>>         usecases were one can get some substantial speed ups with
>>         direct access to prefetch or by abusing misprediction to keep
>>         icache hot. These scenarios are sadly only available with
>>         inline assembly. I know of a few shops that go to the length
>>         of forking Graal, etc to achieve that but am quite convinced
>>         such capabilities would be welcome and utilized by many more
>>         groups if they were easily accessible from java.
>>
>>     I’m of the firm (and perhaps controversial for some :)) opinion
>>     these days that Java is simply the wrong platform/tool for low
>>     latency cases that warrant this level of control. There’re very
>>     strong headwinds even outside of JNI costs.  And the “real”
>>     problem with JNI, besides transition costs, is lack of inlining
>>     into the native calls.  So even if JVM transition costs are fully
>>     eliminated, there’s still an optimization fence due to lost
>>     inlining (not unlike native code calling native fns via shared libs).
>>
>>     That’s not say that perf regressions are welcomed - nobody likes
>>     those :).
>>
>>
>>
>>         Thanks,
>>         W.
>>
>>         On Mon, Jul 4, 2022 at 5:51 PM Vitaly Davidovich
>>         <vitalyd at gmail.com> wrote:
>>
>>             I’d add rdtsc(p) wrapper functions to the list.  These
>>             are usually either inline asm or compiler intrinsic in
>>             the JNI entrypoint.  In addition, any native libs exposed
>>             via JNI that have “trivial” functions are also candidates
>>             for faster calling conventions.  There’re sometimes way
>>             to mitigate the call overhead (eg batching) but it’s not
>>             always feasible.
>>
>>             I’ll add that last time I tried to measure the
>>             improvement of Java criticals for clock_gettime (and
>>             rdtsc) it looked to be in the noise on the hardware I was
>>             testing on.  It got the point where I had to instrument
>>             the critical and normal JNI entrypoints to confirm the
>>             critical was being hit.  The critical calling convention
>>             isn’t significantly different *if* basic primitives (or
>>             no args at all) are passed as args. JNIEnv*, IIRC, is
>>             loaded from a register so that’s minor.  jclass (for
>>             static calls, which is what’s relevant here) should be a
>>             compiled constant.  Critical call still has a GCLocker
>>             check.  So I’m not actually sure what the significant
>>             difference is for “lightweight” (ie few primitive or no
>>             args, primitive return types) calls.
>>
>>             In general, I do think it’d be nice if there was a faster
>>             native call sequence, even if it comes with a caveat
>>             emptor and/or special requirements on the callee (not
>>             unlike the requirements for criticals).  I think Vladimir
>>             Ivanov was working on “snippets” that allowed dynamic
>>             construction of a native call, possibly including
>>             assembly.  Not sure where that exploration is these days,
>>             but that would be a welcome capability.
>>
>>             My $.02.  Happy 4th of July for those celebrating!
>>
>>             Vitaly
>>
>>             On Mon, Jul 4, 2022 at 12:04 PM Maurizio Cimadamore
>>             <maurizio.cimadamore at oracle.com> wrote:
>>
>>                 Hi,
>>                 while I'm not an expert with some of the IO calls you
>>                 mention (some of my colleagues are more knowledgeable
>>                 in this area, so I'm sure they will have more info),
>>                 my general sense is that, as with getrusage, if there
>>                 is a system call involved, you already pay a hefty
>>                 price for the user to kernel transition. On my
>>                 machine this seem to cost around 200ns. In these
>>                 cases, using JNI critical to shave off a dozen of
>>                 nanoseconds (at best!) seems just not worth it.
>>
>>                 So, of the functions in your list, the ones in which
>>                 I *believe* dropping transitions would have the most
>>                 effect are (if we exclude getpid, for which another
>>                 approach is possible) clock_gettime and getcpu, I
>>                 believe, as they might use vdso [1], which typically
>>                 brings the performance of these call closer to calls
>>                 to shared lib functions.
>>
>>                 If you have examples e.g. where performance of
>>                 recvmsg (or related calls) varies significantly
>>                 between base JNI and critical JNI, please send them
>>                 our way; I'm sure some of my colleagues would be
>>                 intersted to take a look.
>>
>>                 Popping back a couple of levels, I think it would be
>>                 helpful to also define what's an acceptable
>>                 regression in this context. Of course, in an ideal
>>                 world,  we'd like to see no performance regression at
>>                 all. But JNI critical is an unsupported interface,
>>                 which might misbehave with modern garbage collectors
>>                 (e.g. ZGC) and that requires quite a bit of internal
>>                 complexity which might, in the medium/long run,
>>                 hinder the evolution of the Java platform (all these
>>                 things have _some_ cost, even if the cost is not
>>                 directly material to developers). In this vein, I
>>                 think calls like clock_gettime tend to be more
>>                 problematic: as they complete very quickly, you see
>>                 the cost of transitions a lot more. In other cases,
>>                 where syscalls are involved, the cost associated to
>>                 transitions are more likely to be "in the noise". Of
>>                 course if we look at absolute numbers, dropping
>>                 transitions would always yield "faster" code; but at
>>                 the same time, going from 250ns to 245ns is very
>>                 unlikely to result in visible performance difference
>>                 when considering an application as a whole, so I
>>                 think it's critical here to decide _which_ use cases
>>                 to prioritize.
>>
>>                 I think a good outcome of this discussion would be if
>>                 we could come to some shared understanding of which
>>                 native calls are truly problematic (e.g.
>>                 clock_gettime-like), and then for the JDK to provide
>>                 better (and more maintainable) alternatives for those
>>                 (which might even be faster than using critical JNI).
>>
>>                 Thanks
>>                 Maurizio
>>
>>                 [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
>>
>>                 On 04/07/2022 12:23, Wojciech Kudla wrote:
>>>                 Thanks Maurizio,
>>>
>>>                 I raised this case mainly about clock_gettime and
>>>                 recvmsg/sendmsg, I think we're focusing on the wrong
>>>                 things here. Feel free to drop the two syscalls from
>>>                 the discussion entirely, but the main usecases I
>>>                 have been presenting throughout this thread
>>>                 definitely stand.
>>>
>>>                 Thanks
>>>
>>>
>>>                 On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore
>>>                 <maurizio.cimadamore at oracle.com> wrote:
>>>
>>>                     Hi Wojtek,
>>>                     thanks for sharing this list, I think this is a
>>>                     good starting point to understand more about
>>>                     your use case.
>>>
>>>                     Last week I've been looking at "getrusage" (as
>>>                     you mentioned it in an earlier email), and I was
>>>                     surprised to see that the call took a pointer to
>>>                     a (fairly big) struct which then needed to be
>>>                     initialized with some thread-local state:
>>>
>>>                     https://man7.org/linux/man-pages/man2/getrusage.2.html
>>>
>>>                     I've looked at the implementation, and it seems
>>>                     to be doing memset on the user-provided struct
>>>                     pointer, plus all the fields assignment.
>>>                     Eyeballing the implementation, this does not
>>>                     seem to me like a "classic" use case where
>>>                     dropping transition would help much. I mean,
>>>                     surely dropping transitions would help shaving
>>>                     some nanoseconds off the call, but it doesn't
>>>                     seem to me that the call would be shortlived
>>>                     enough to make a difference. Do you have some
>>>                     benchmarks on this one? I did some [1] and the
>>>                     call overhead seemed to come up at 260ns/op -
>>>                     w/o transition you might perhaps be able to get
>>>                     to 250ns, but that's in the noise?
>>>
>>>                     As for getpid, note that you can do (since Java 9):
>>>
>>>                     ProcessHandle.current().pid();
>>>
>>>                     I believe the impl caches the result, so it
>>>                     shouldn't even make the native call.
>>>
>>>                     Maurizio
>>>
>>>                     [1] -
>>>                     http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>>>
>>>                     On 02/07/2022 07:42, Wojciech Kudla wrote:
>>>>                     Hi Maurizio,
>>>>
>>>>                     Thanks for staying on this.
>>>>
>>>>                     > Could you please provide a rough list of the
>>>>                     native calls you make where you believe
>>>>                     critical JNI is having a real impact in the
>>>>                     performance of your application?
>>>>
>>>>                     From the top of my head:
>>>>                     clock_gettime
>>>>                     recvmsg
>>>>                     recvmmsg
>>>>                     sendmsg
>>>>                     sendmmsg
>>>>                     select
>>>>                     getpid
>>>>                     getcpu
>>>>                     getrusage
>>>>
>>>>                     > Also, could you please tell us whether any of
>>>>                     these calls need to interact with Java arrays?
>>>>                     No arrays or objects of any type involved.
>>>>                     Everything happens by the means of passing raw
>>>>                     pointers as longs and using other primitive
>>>>                     types as function arguments.
>>>>
>>>>                     > In other words, do you use critical JNI to
>>>>                     remove the cost associated with thread
>>>>                     transitions, or are you also taking advantage
>>>>                     of accessing on-heap memory _directly_ from
>>>>                     native code?
>>>>                     Criticial JNI natives are used solely to remove
>>>>                     the cost of transitions. We don't get anywhere
>>>>                     near java heap in native code.
>>>>
>>>>                     In general I think it makes a lot of sense for
>>>>                     Java as a language/platform to have some guards
>>>>                     around unsafe code, but on the other hand the
>>>>                     popularity of libraries employing Unsafe and
>>>>                     their success in more performance-oriented
>>>>                     corners of software engineering is a clear
>>>>                     indicator there is a need for the JVM to
>>>>                     provide access to more low-level primitives and
>>>>                     mechanisms.
>>>>                     I think it's entirely fair to tell developers
>>>>                     that all bets are off when they get into some
>>>>                     non-idiomatic scenarios but please don't take
>>>>                     away a feature that greatly contributed to
>>>>                     Java's success.
>>>>
>>>>                     Kind regards,
>>>>                     Wojtek
>>>>
>>>>                     On Wed, Jun 29, 2022 at 5:20 PM Maurizio
>>>>                     Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>>>
>>>>                         Hi Wojciech,
>>>>                         picking up this thread again. After some
>>>>                         internal discussion, we realize that we
>>>>                         don't know enough about your use case.
>>>>                         While re-enabling JNI critical would
>>>>                         obviously provide a quick fix, we're afraid
>>>>                         that (a) developers might end up depending
>>>>                         on JNI critical when they don't need to
>>>>                         (perhaps also unaware of the consequences
>>>>                         of depending on it) and (b) that there
>>>>                         might actually be _better_ (as in: much
>>>>                         faster) solutions than using critical
>>>>                         native calls to address at least some of
>>>>                         your use cases (that seemed to be the case
>>>>                         with the clock_gettime example you
>>>>                         mentioned). Could you please provide a
>>>>                         rough list of the native calls you make
>>>>                         where you believe critical JNI is having a
>>>>                         real impact in the performance of your
>>>>                         application? Also, could you please tell us
>>>>                         whether any of these calls need to interact
>>>>                         with Java arrays? In other words, do you
>>>>                         use critical JNI to remove the cost
>>>>                         associated with thread transitions, or are
>>>>                         you also taking advantage of accessing
>>>>                         on-heap memory _directly_ from native code?
>>>>
>>>>                         Regards
>>>>                         Maurizio
>>>>
>>>>                         On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>>>                         Hi Mark,
>>>>>
>>>>>                         Thanks for your input and apologies for
>>>>>                         the delayed response.
>>>>>
>>>>>                         > If the platform included, say, an
>>>>>                         intrinsified System.nanoRealTime()
>>>>>                         method that returned
>>>>>                         clock_gettime(CLOCK_REALTIME), how much would
>>>>>                         that help developers in your unnamed industry?
>>>>>
>>>>>                         Exposing realtime clock with nanosecond
>>>>>                         granularity in the JDK would be a great
>>>>>                         step forward. I should have made it clear
>>>>>                         that I represent fintech corner
>>>>>                         (investment banking to be exact) but the
>>>>>                         issues my message touches upon span areas
>>>>>                         such as HPC, audio processing, gaming, and
>>>>>                         defense industry so it's not like we have
>>>>>                         an isolated case.
>>>>>
>>>>>                         > In a similar vein, if people are finding
>>>>>                         it necessary to “replace parts
>>>>>                         of NIO with hand-crafted native code” then
>>>>>                         it would be interesting to
>>>>>                         understand what their requirements are
>>>>>
>>>>>                         As for the other example I provided with
>>>>>                         making very short lived syscalls such as
>>>>>                         recvmsg/recvmmsg the premise is getting
>>>>>                         access to hardware timestamps on the
>>>>>                         ingress and egress ends as well as
>>>>>                         enabling batch receive with a single
>>>>>                         syscall and otherwise exploiting features
>>>>>                         unavailable from the JDK (like access to
>>>>>                         CMSG interface, scatter/gather, etc).
>>>>>                         There are also other examples of calls
>>>>>                         that we'd love to make often and at lowest
>>>>>                         possible cost (ie. getrusage) but I'm not
>>>>>                         sure if there's a strong case for some of
>>>>>                         these ideas, that's why it might be worth
>>>>>                         looking into more generic approach for
>>>>>                         performance sensitive code.
>>>>>                         Hope this does better job at explaining
>>>>>                         where we're coming from than my previous
>>>>>                         messages.
>>>>>
>>>>>                         Thanks,
>>>>>                         W
>>>>>
>>>>>                         On Tue, Jun 7, 2022 at 6:31 PM
>>>>>                         <mark.reinhold at oracle.com> wrote:
>>>>>
>>>>>                             2022/6/6 0:24:17 -0700,
>>>>>                             wkudla.kernel at gmail.com:
>>>>>                             >> Yes for System.nanoTime(), but
>>>>>                             System.currentTimeMillis() reports
>>>>>                             >> CLOCK_REALTIME.
>>>>>                             >
>>>>>                             > Unfortunately
>>>>>                             System.currentTimeMillis() offers only
>>>>>                             millisecond
>>>>>                             > granularity which is the reason why
>>>>>                             our industry has to resort to
>>>>>                             > clock_gettime.
>>>>>
>>>>>                             If the platform included, say, an
>>>>>                             intrinsified System.nanoRealTime()
>>>>>                             method that returned
>>>>>                             clock_gettime(CLOCK_REALTIME), how
>>>>>                             much would
>>>>>                             that help developers in your unnamed
>>>>>                             industry?
>>>>>
>>>>>                             In a similar vein, if people are
>>>>>                             finding it necessary to “replace parts
>>>>>                             of NIO with hand-crafted native code”
>>>>>                             then it would be interesting to
>>>>>                             understand what their requirements
>>>>>                             are.  Some simple enhancements to
>>>>>                             the NIO API would be much less costly
>>>>>                             to design and implement than a
>>>>>                             generalized user-level native-call
>>>>>                             intrinsification mechanism.
>>>>>
>>>>>                             - Mark
>>>>>
>>             -- 
>>             Sent from my phone
>>
>>     -- 
>>     Sent from my phone
>>
>> -- 
>> Sent from my phone
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220707/9a244ced/attachment-0001.htm>