Obsoleting JavaCritical

Tue Jul 5 11:33:02 UTC 2022

Hi,
As Erik explained in his reply, what we call "critical JNI" comes in two 
pieces: one removes Java to native thread transitions (which is what 
Wojciech is referring to), while another part interacts with the GC 
locker (basically to allow critical JNI code to access Java arrays w/o 
copying). I think the latter part is the most problematic GC-wise.

Then, regarding the former, I think there are still questions as to 
whether dropping transitions is the best way to get the performance 
boost required; for instance, yesterday I did some experiments with an 
experimental patch from Jorn (kudos) which re-enables an opt-in for 
"trivial" native calls in the Panama API. I used it to test 
clock_gettime, and, while there's an improvement, the results I got were 
not as conclusive as one might expect expected. This is what I get w/ 
state transitions:

```
Benchmark                                 Mode  Cnt   Score Error  Units
ClockgettimeTest.panama_monotonic         avgt   30  27.814 ± 0.165  ns/op
ClockgettimeTest.panama_monotonic_coarse  avgt   30  12.094 ± 0.103  ns/op
ClockgettimeTest.panama_monotonic_raw     avgt   30  27.719 ± 0.393  ns/op
ClockgettimeTest.panama_realtime          avgt   30  27.133 ± 0.280  ns/op
ClockgettimeTest.panama_realtime_coarse   avgt   30  26.812 ± 0.384  ns/op
```

And this is what I get with transitions removed:

```
Benchmark                                 Mode  Cnt   Score Error  Units
ClockgettimeTest.panama_monotonic         avgt   30  22.383 ± 0.213  ns/op
ClockgettimeTest.panama_monotonic_coarse  avgt   30   6.312 ± 0.117  ns/op
ClockgettimeTest.panama_monotonic_raw     avgt   30  22.731 ± 0.279  ns/op
ClockgettimeTest.panama_realtime          avgt   30  22.503 ± 0.292  ns/op
ClockgettimeTest.panama_realtime_coarse   avgt   30  21.853 ± 0.100  ns/op

```

Here we can see a gain of 4-5ns, obtained by dropping the transition. 
The only case where this makes a significant difference is with the 
monotonic_coarse flavor. In the other cases there's a difference, yes, 
but not as pronounced, simply because the term we're comparing against 
is bigger: it's easy to see a 5ns gain if your function runs for 10ns in 
total - but such a gain starts to get lost in the "noise" when functions 
run for longer. And that's the main issue with removing Java->native 
transitions: the "window" in which this optimization yield a positive 
effect is extremely narrow (anything lasting longer than 30ns won't 
probably appreciate much difference), but, as you can see from the PR in 
[1], the VM changes required to support it touch quite a bit of stuff!

Luckily, selectively disabling transitions from Panama is slightly more 
straightforward and, perhaps, for stuff like recvmsg syscalls that are 
bypassed, there's not much else we can do: while one could imagine 
Panama special-casing calls to clock_gettime, as that's a known "leaf", 
the same cannot be done with rcvmsg, which is in general a blocking 
call. Panama also has a "trusted mode" flag (--enable-native-access), so 
there is a way in the Panama API to distinguish between safe and unsafe 
API point, which also helps with this. The risk of course is for 
developers to see whatever mechanism is provided as some kind of "make 
my code go fast please" and apply it blindly, w/o fully understanding 
the consequences. What I said before about "extremely narrow window" 
remains true: in the vast majority of cases (like 99%) dropping state 
transitions can result in very big downsides, while the corresponding 
upsides are not big enough to even be noticeable (the Q/A in [2] arrives 
at a very similar conclusion).

All this said, selectively disabling state transitions from native calls 
made using the Panama foreign API seem the most straightforward way to 
offset the performance delta introduced by the removal of critical JNI. 
In part it's because the Panama API is more flexible, e.g. function 
descriptors allows us to model the distinction between a trivial and 
non-trivial call; in part it's because, as stated above, Panama can 
already reason about calls that are "unsafe" and that require extra 
permissions. And, finally it's also because, if we added back critical 
JNI, we'd probably add it back w/o its most problematic GC locker parts 
(that's what [1] does AFAIK) - which means it won't be a complete code 
reversal. So, perhaps, coming up with a fresh mechanism to drop 
transitions (only) could also be less confusing for developers. Of 
course this would require developers such as Wojciech to rewrite some of 
the code to use Panama instead of JNI.

And, coming back to clock_gettime, my feeling is that with the right 
tools (e.g. some intrinsics), we can make that go a lot faster than what 
shown above. Being able to quickly get a timestamp seems a widely-enough 
applicable use case to deserves some special treatment. So, perhaps, 
it's worth considering a _spectrum of solutions_ on how to improve the 
status quo, rather than investing solely on the removal of thread 
transitions.

Maurizio

[1] - https://github.com/openjdk/jdk19/pull/90/files
[2] - https://youtu.be/LoyBTqkSkZk?t=742

On 04/07/2022 18:38, Vitaly Davidovich wrote:
> To not sidetrack this thread with my previous reply:
>
> Maurizio - are you saying java criticals are *already* hindering ZGC 
> and/or other planned Hotspot improvements? Or that theoretically they 
> could and you’d like to remove/deprecate them now(ish)?
>
> If it’s the former, perhaps it’s prudent to keep them around until a 
> compelling case surfaces where they preclude or severely restrict 
> evolution of the platform? If it’s the former, would be curious what 
> that is but would also understand the rationale behind wanting to 
> remove it.
>
> On Mon, Jul 4, 2022 at 1:26 PM Vitaly Davidovich <vitalyd at gmail.com> 
> wrote:
>
>
>
>     On Mon, Jul 4, 2022 at 1:13 PM Wojciech Kudla
>     <wkudla.kernel at gmail.com> wrote:
>
>         Thanks for your input, Vitaly. I'd be interested to find out
>         more about the nature of the HW noise you observed in your
>         benchmarks as our results were very consistent and it was
>         pretty straightforward to pinpoint the culprit as JNI call
>         overhead. Maybe it was just easier for us because we disallow
>         C- and P-state transitions and put a lot of effort to
>         eliminate platform jitter in general. Were you maybe running
>         on a CPU model that doesn't support constant TSC? I would also
>         suggest retrying with LAPIC interrupts suppressed (with:
>         cli/sti) to maybe see if it's the kernel and not the hardware.
>
>     This was on a Broadwell Xeon chipset with constant tsc.  All the
>     typical jitter sources were reduced: C/P states disabled in bios,
>     max turbo enabled, IRQs steered away, core isolated, etc.  By the
>     way, by noise I don’t mean the results themselves were noisy -
>     they were constant run to run.  I just meant the delta between
>     normal vs critical JNI entrypoints was very minimal - ie “in the
>     noise”, particularly with rdtsc.
>
>     I can try to remeasure on newer Intel but see below …
>
>
>
>         100% agree on rdtsc(p) and snippets. There are some narrow
>         usecases were one can get some substantial speed ups with
>         direct access to prefetch or by abusing misprediction to keep
>         icache hot. These scenarios are sadly only available with
>         inline assembly. I know of a few shops that go to the length
>         of forking Graal, etc to achieve that but am quite convinced
>         such capabilities would be welcome and utilized by many more
>         groups if they were easily accessible from java.
>
>     I’m of the firm (and perhaps controversial for some :)) opinion
>     these days that Java is simply the wrong platform/tool for low
>     latency cases that warrant this level of control.  There’re very
>     strong headwinds even outside of JNI costs.  And the “real”
>     problem with JNI, besides transition costs, is lack of inlining
>     into the native calls.  So even if JVM transition costs are fully
>     eliminated, there’s still an optimization fence due to lost
>     inlining (not unlike native code calling native fns via shared libs).
>
>     That’s not say that perf regressions are welcomed - nobody likes
>     those :).
>
>
>
>         Thanks,
>         W.
>
>         On Mon, Jul 4, 2022 at 5:51 PM Vitaly Davidovich
>         <vitalyd at gmail.com> wrote:
>
>             I’d add rdtsc(p) wrapper functions to the list.  These are
>             usually either inline asm or compiler intrinsic in the JNI
>             entrypoint.  In addition, any native libs exposed via JNI
>             that have “trivial” functions are also candidates for
>             faster calling conventions.  There’re sometimes way to
>             mitigate the call overhead (eg batching) but it’s not
>             always feasible.
>
>             I’ll add that last time I tried to measure the improvement
>             of Java criticals for clock_gettime (and rdtsc) it looked
>             to be in the noise on the hardware I was testing on.  It
>             got the point where I had to instrument the critical and
>             normal JNI entrypoints to confirm the critical was being
>             hit.  The critical calling convention isn’t significantly
>             different *if* basic primitives (or no args at all) are
>             passed as args.  JNIEnv*, IIRC, is loaded from a register
>             so that’s minor.  jclass (for static calls, which is
>             what’s relevant here) should be a compiled constant. 
>             Critical call still has a GCLocker check.  So I’m not
>             actually sure what the significant difference is for
>             “lightweight” (ie few primitive or no args, primitive
>             return types) calls.
>
>             In general, I do think it’d be nice if there was a faster
>             native call sequence, even if it comes with a caveat
>             emptor and/or special requirements on the callee (not
>             unlike the requirements for criticals).  I think Vladimir
>             Ivanov was working on “snippets” that allowed dynamic
>             construction of a native call, possibly including
>             assembly.  Not sure where that exploration is these days,
>             but that would be a welcome capability.
>
>             My $.02.  Happy 4th of July for those celebrating!
>
>             Vitaly
>
>             On Mon, Jul 4, 2022 at 12:04 PM Maurizio Cimadamore
>             <maurizio.cimadamore at oracle.com> wrote:
>
>                 Hi,
>                 while I'm not an expert with some of the IO calls you
>                 mention (some of my colleagues are more knowledgeable
>                 in this area, so I'm sure they will have more info),
>                 my general sense is that, as with getrusage, if there
>                 is a system call involved, you already pay a hefty
>                 price for the user to kernel transition. On my machine
>                 this seem to cost around 200ns. In these cases, using
>                 JNI critical to shave off a dozen of nanoseconds (at
>                 best!) seems just not worth it.
>
>                 So, of the functions in your list, the ones in which I
>                 *believe*  dropping transitions would have the most
>                 effect are (if we exclude getpid, for which another
>                 approach is possible) clock_gettime and getcpu, I
>                 believe, as they might use vdso [1], which typically
>                 brings the performance of these call closer to calls
>                 to shared lib functions.
>
>                 If you have examples e.g. where performance of recvmsg
>                 (or related calls) varies significantly between base
>                 JNI and critical JNI, please send them our way; I'm
>                 sure some of my colleagues would be intersted to take
>                 a look.
>
>                 Popping back a couple of levels, I think it would be
>                 helpful to also define what's an acceptable regression
>                 in this context. Of course, in an ideal world, we'd
>                 like to see no performance regression at all. But JNI
>                 critical is an unsupported interface, which might
>                 misbehave with modern garbage collectors (e.g. ZGC)
>                 and that requires quite a bit of internal complexity
>                 which might, in the medium/long run, hinder the
>                 evolution of the Java platform (all these things have
>                 _some_ cost, even if the cost is not directly material
>                 to developers). In this vein, I think calls like
>                 clock_gettime tend to be more problematic: as they
>                 complete very quickly, you see the cost of transitions
>                 a lot more. In other cases, where syscalls are
>                 involved, the cost associated to transitions are more
>                 likely to be "in the noise". Of course if we look at
>                 absolute numbers, dropping transitions would always
>                 yield "faster" code; but at the same time, going from
>                 250ns to 245ns is very unlikely to result in visible
>                 performance difference when considering an application
>                 as a whole, so I think it's critical here to decide
>                 _which_ use cases to prioritize.
>
>                 I think a good outcome of this discussion would be if
>                 we could come to some shared understanding of which
>                 native calls are truly problematic (e.g.
>                 clock_gettime-like), and then for the JDK to provide
>                 better (and more maintainable) alternatives for those
>                 (which might even be faster than using critical JNI).
>
>                 Thanks
>                 Maurizio
>
>                 [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
>
>                 On 04/07/2022 12:23, Wojciech Kudla wrote:
>>                 Thanks Maurizio,
>>
>>                 I raised this case mainly about clock_gettime and
>>                 recvmsg/sendmsg, I think we're focusing on the wrong
>>                 things here. Feel free to drop the two syscalls from
>>                 the discussion entirely, but the main usecases I have
>>                 been presenting throughout this thread definitely stand.
>>
>>                 Thanks
>>
>>
>>                 On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore
>>                 <maurizio.cimadamore at oracle.com> wrote:
>>
>>                     Hi Wojtek,
>>                     thanks for sharing this list, I think this is a
>>                     good starting point to understand more about your
>>                     use case.
>>
>>                     Last week I've been looking at "getrusage" (as
>>                     you mentioned it in an earlier email), and I was
>>                     surprised to see that the call took a pointer to
>>                     a (fairly big) struct which then needed to be
>>                     initialized with some thread-local state:
>>
>>                     https://man7.org/linux/man-pages/man2/getrusage.2.html
>>
>>                     I've looked at the implementation, and it seems
>>                     to be doing memset on the user-provided struct
>>                     pointer, plus all the fields assignment.
>>                     Eyeballing the implementation, this does not seem
>>                     to me like a "classic" use case where dropping
>>                     transition would help much. I mean, surely
>>                     dropping transitions would help shaving some
>>                     nanoseconds off the call, but it doesn't seem to
>>                     me that the call would be shortlived enough to
>>                     make a difference. Do you have some benchmarks on
>>                     this one? I did some [1] and the call overhead
>>                     seemed to come up at 260ns/op - w/o transition
>>                     you might perhaps be able to get to 250ns, but
>>                     that's in the noise?
>>
>>                     As for getpid, note that you can do (since Java 9):
>>
>>                     ProcessHandle.current().pid();
>>
>>                     I believe the impl caches the result, so it
>>                     shouldn't even make the native call.
>>
>>                     Maurizio
>>
>>                     [1] -
>>                     http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>>
>>                     On 02/07/2022 07:42, Wojciech Kudla wrote:
>>>                     Hi Maurizio,
>>>
>>>                     Thanks for staying on this.
>>>
>>>                     > Could you please provide a rough list of the
>>>                     native calls you make where you believe critical
>>>                     JNI is having a real impact in the performance
>>>                     of your application?
>>>
>>>                     From the top of my head:
>>>                     clock_gettime
>>>                     recvmsg
>>>                     recvmmsg
>>>                     sendmsg
>>>                     sendmmsg
>>>                     select
>>>                     getpid
>>>                     getcpu
>>>                     getrusage
>>>
>>>                     > Also, could you please tell us whether any of
>>>                     these calls need to interact with Java arrays?
>>>                     No arrays or objects of any type involved.
>>>                     Everything happens by the means of passing raw
>>>                     pointers as longs and using other primitive
>>>                     types as function arguments.
>>>
>>>                     > In other words, do you use critical JNI to
>>>                     remove the cost associated with thread
>>>                     transitions, or are you also taking advantage of
>>>                     accessing on-heap memory _directly_ from native
>>>                     code?
>>>                     Criticial JNI natives are used solely to remove
>>>                     the cost of transitions. We don't get anywhere
>>>                     near java heap in native code.
>>>
>>>                     In general I think it makes a lot of sense for
>>>                     Java as a language/platform to have some guards
>>>                     around unsafe code, but on the other hand the
>>>                     popularity of libraries employing Unsafe and
>>>                     their success in more performance-oriented
>>>                     corners of software engineering is a clear
>>>                     indicator there is a need for the JVM to provide
>>>                     access to more low-level primitives and mechanisms.
>>>                     I think it's entirely fair to tell developers
>>>                     that all bets are off when they get into some
>>>                     non-idiomatic scenarios but please don't take
>>>                     away a feature that greatly contributed to
>>>                     Java's success.
>>>
>>>                     Kind regards,
>>>                     Wojtek
>>>
>>>                     On Wed, Jun 29, 2022 at 5:20 PM Maurizio
>>>                     Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>>
>>>                         Hi Wojciech,
>>>                         picking up this thread again. After some
>>>                         internal discussion, we realize that we
>>>                         don't know enough about your use case. While
>>>                         re-enabling JNI critical would obviously
>>>                         provide a quick fix, we're afraid that (a)
>>>                         developers might end up depending on JNI
>>>                         critical when they don't need to (perhaps
>>>                         also unaware of the consequences of
>>>                         depending on it) and (b) that there might
>>>                         actually be _better_ (as in: much faster)
>>>                         solutions than using critical native calls
>>>                         to address at least some of your use cases
>>>                         (that seemed to be the case with the
>>>                         clock_gettime example you mentioned). Could
>>>                         you please provide a rough list of the
>>>                         native calls you make where you believe
>>>                         critical JNI is having a real impact in the
>>>                         performance of your application? Also, could
>>>                         you please tell us whether any of these
>>>                         calls need to interact with Java arrays? In
>>>                         other words, do you use critical JNI to
>>>                         remove the cost associated with thread
>>>                         transitions, or are you also taking
>>>                         advantage of accessing on-heap memory
>>>                         _directly_ from native code?
>>>
>>>                         Regards
>>>                         Maurizio
>>>
>>>                         On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>>                         Hi Mark,
>>>>
>>>>                         Thanks for your input and apologies for the
>>>>                         delayed response.
>>>>
>>>>                         > If the platform included, say, an
>>>>                         intrinsified System.nanoRealTime()
>>>>                         method that returned
>>>>                         clock_gettime(CLOCK_REALTIME), how much would
>>>>                         that help developers in your unnamed industry?
>>>>
>>>>                         Exposing realtime clock with nanosecond
>>>>                         granularity in the JDK would be a great
>>>>                         step forward. I should have made it clear
>>>>                         that I represent fintech corner (investment
>>>>                         banking to be exact) but the issues my
>>>>                         message touches upon span areas such as
>>>>                         HPC, audio processing, gaming, and defense
>>>>                         industry so it's not like we have an
>>>>                         isolated case.
>>>>
>>>>                         > In a similar vein, if people are finding
>>>>                         it necessary to “replace parts
>>>>                         of NIO with hand-crafted native code” then
>>>>                         it would be interesting to
>>>>                         understand what their requirements are
>>>>
>>>>                         As for the other example I provided with
>>>>                         making very short lived syscalls such as
>>>>                         recvmsg/recvmmsg the premise is getting
>>>>                         access to hardware timestamps on the
>>>>                         ingress and egress ends as well as enabling
>>>>                         batch receive with a single syscall and
>>>>                         otherwise exploiting features unavailable
>>>>                         from the JDK (like access to CMSG
>>>>                         interface, scatter/gather, etc).
>>>>                         There are also other examples of calls that
>>>>                         we'd love to make often and at lowest
>>>>                         possible cost (ie. getrusage) but I'm not
>>>>                         sure if there's a strong case for some of
>>>>                         these ideas, that's why it might be worth
>>>>                         looking into more generic approach for
>>>>                         performance sensitive code.
>>>>                         Hope this does better job at explaining
>>>>                         where we're coming from than my previous
>>>>                         messages.
>>>>
>>>>                         Thanks,
>>>>                         W
>>>>
>>>>                         On Tue, Jun 7, 2022 at 6:31 PM
>>>>                         <mark.reinhold at oracle.com> wrote:
>>>>
>>>>                             2022/6/6 0:24:17 -0700,
>>>>                             wkudla.kernel at gmail.com:
>>>>                             >> Yes for System.nanoTime(), but
>>>>                             System.currentTimeMillis() reports
>>>>                             >> CLOCK_REALTIME.
>>>>                             >
>>>>                             > Unfortunately
>>>>                             System.currentTimeMillis() offers only
>>>>                             millisecond
>>>>                             > granularity which is the reason why
>>>>                             our industry has to resort to
>>>>                             > clock_gettime.
>>>>
>>>>                             If the platform included, say, an
>>>>                             intrinsified System.nanoRealTime()
>>>>                             method that returned
>>>>                             clock_gettime(CLOCK_REALTIME), how much
>>>>                             would
>>>>                             that help developers in your unnamed
>>>>                             industry?
>>>>
>>>>                             In a similar vein, if people are
>>>>                             finding it necessary to “replace parts
>>>>                             of NIO with hand-crafted native code”
>>>>                             then it would be interesting to
>>>>                             understand what their requirements
>>>>                             are.  Some simple enhancements to
>>>>                             the NIO API would be much less costly
>>>>                             to design and implement than a
>>>>                             generalized user-level native-call
>>>>                             intrinsification mechanism.
>>>>
>>>>                             - Mark
>>>>
>             -- 
>             Sent from my phone
>
>     -- 
>     Sent from my phone
>
> -- 
> Sent from my phone
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20220705/b1ffbea3/attachment-0001.htm>