Obsoleting JavaCritical

Wed Jul 6 18:19:22 UTC 2022

Hey all,

Afaik, the Java->native transition is required because an upcall back
to Java during the JNI downcall would otherwise crash the JVM. Are
transitions required for any other reason? If not, would it be viable
to move the expensive part of the transition (or even all of it) to
the upcall? I.e. something like:

1. On downcall, do not transition but set a thread state flag.
2. If an upcall happens, check the flag and do the Java->native
transition, before proceeding as usual.
3. When the downcall returns, do the native->Java transition only if
#2 happened.

(assuming #1 and the no-transition #3 happy path can be implemented
cheaply, e.g. no expensive memory barriers)

Note that, unlike Panama, JNI upcalls are already extremely expensive
compared to downcalls and every application out there should easily
cope with a bit more overhead. I would say that even in Panama it
would make sense to move overhead to upcalls, if it made downcalls
faster.

Btw, for those wondering how an upcall-during-a-downcall could
unintentionally happen in a real-world application, a good example is
OpenGL debug mode. Normally, an OpenGL application may be doing
hundreds, if not thousands, of JNI downcalls per frame. In a
well-designed rendering engine, 99% of those would be asynchronous
calls that simply pass some data to the driver and return immediately.
A perfect candidate to apply JNI CriticalNatives on. However, if the
application is started with a debug OpenGL context and it registers a
debug message callback, then suddenly almost every OpenGL call has the
potential to call back into Java to report an error. Similar
functionality exists in several other APIs (such as OpenCL, OpenXR and
Vulkan).

- Ioannis

On Wed, 6 Jul 2022 at 16:54, Erik Osterlund <erik.osterlund at oracle.com> wrote:
>
> For completeness, it should at least be considered that an alternative on the table is to make the JNI transitions fast using asymmetric dekker synchronization.
>
> If I understood the problem domain, you are running on linux, and not really using the async GC locking associated with exposing object addresses, but rather want the actual native call to be fast.
>
> In that context the arming side of handshakes/safepoints could use sys_membarrier where there is currently a StoreLoad fence. That way we could remove the StoreLoad fence on the back edge of the native transition, which is likely what actually costs something (last time I checked).
>
> In general, I’m not sure that this is a worthwhile tradeoff as the amortized cost of fencing has to sum up to the cost of the bigger hammer to be worth it. That would be a lot of native calls to pay for itself. But I suppose that alternative should at least be mentioned as it is a perfectly safe way of speeding up all native calls without resorting to cheating. The single thread handshake would be the most painful in this approach as we would use global synchronization to poke a single thread, unless we shot a signal or something instead for that use case.
>
> /Erik
>
> On 4 Jul 2022, at 23:47, Erik Osterlund <erik.osterlund at oracle.com> wrote:
>
> 
> Hi,
>
> Here is a clarification on the ZGC interactions.
>
> The initial form of JNI critical native calls was implemented as an internal thing for SPARC crypto libraries, private to the JDK. JNI calls on SPARC involved flushing register windows, which was actually rather slow.
>
> This form came with a mechanism for lazily activating the GC locker for primitive arrays that the crypto code needed direct access to. This essentially deferred invoking the GC locker from the Java thread to the safepoint synchronizer.
>
> The problematic aspect for generational ZGC was the async GC locker interactions. Its implication is that each GC safepoint might fail, because the GC locker can’t be locked out before the safepoint is synchronized, so you end up instead trying to lock it inside GC safepoints, only to find that you couldn’t.
>
> The failed GC safepoints lead to GC opertions instead being started asynchronously from the GC locker. That was easier to deal with for the mainline version of ZGC since there was only one type of GC: full GCs. So we coped.
>
> With generational ZGC, the asynchronous operation has to figure out if it should poke the minor (young) and/or major (young + old) GC drivers. That problem is not easy to solve. However with JNI critical natives gone, the entire GC locker for ZGC is just a simple readers writer lock, where critical native functions use the readers lock and the GC operations use the writer lock. The GC safepoints can’t fail.
>
> With the new implementation that avoids doing a transition to native at all, the mentioned problem no longer occurs, as the safepoint synchronizer won’t allow safepoints to creep in right in the middle of all this. So it would seem we are okay with that. So I think as long as we don’t go with the previous async GC locker solution, we can remove ZGC interactions from the equation.
>
> However, you obviously instead get a trust problem instead with this flavour of cheating the system. Anything that takes a long ish time in a critical native function without a native transition, is going to be a disaster and hang the entire JVM. That is typically something we do not take lightly and is indeed why we have native transitions.
>
> So I would be delighted if we didn’t resurrect ways of cheating the system anyway, unless this is absolutely… critical. It took a long time to get rid of the cheats.
>
> /Erik
>
> On 4 Jul 2022, at 18:07, Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>
> 
>
> Hi,
> while I'm not an expert with some of the IO calls you mention (some of my colleagues are more knowledgeable in this area, so I'm sure they will have more info), my general sense is that, as with getrusage, if there is a system call involved, you already pay a hefty price for the user to kernel transition. On my machine this seem to cost around 200ns. In these cases, using JNI critical to shave off a dozen of nanoseconds (at best!) seems just not worth it.
>
> So, of the functions in your list, the ones in which I *believe*  dropping transitions would have the most effect are (if we exclude getpid, for which another approach is possible) clock_gettime and getcpu, I believe, as they might use vdso [1], which typically brings the performance of these call closer to calls to shared lib functions.
>
> If you have examples e.g. where performance of recvmsg (or related calls) varies significantly between base JNI and critical JNI, please send them our way; I'm sure some of my colleagues would be intersted to take a look.
>
> Popping back a couple of levels, I think it would be helpful to also define what's an acceptable regression in this context. Of course, in an ideal world,  we'd like to see no performance regression at all. But JNI critical is an unsupported interface, which might misbehave with modern garbage collectors (e.g. ZGC) and that requires quite a bit of internal complexity which might, in the medium/long run, hinder the evolution of the Java platform (all these things have _some_ cost, even if the cost is not directly material to developers). In this vein, I think calls like clock_gettime tend to be more problematic: as they complete very quickly, you see the cost of transitions a lot more. In other cases, where syscalls are involved, the cost associated to transitions are more likely to be "in the noise". Of course if we look at absolute numbers, dropping transitions would always yield "faster" code; but at the same time, going from 250ns to 245ns is very unlikely to result in visible performance difference when considering an application as a whole, so I think it's critical here to decide _which_ use cases to prioritize.
>
> I think a good outcome of this discussion would be if we could come to some shared understanding of which native calls are truly problematic (e.g. clock_gettime-like), and then for the JDK to provide better (and more maintainable) alternatives for those (which might even be faster than using critical JNI).
>
> Thanks
> Maurizio
>
> [1] - https://man7.org/linux/man-pages/man7/vdso.7.html
>
> On 04/07/2022 12:23, Wojciech Kudla wrote:
>
> Thanks Maurizio,
>
> I raised this case mainly about clock_gettime and recvmsg/sendmsg, I think we're focusing on the wrong things here. Feel free to drop the two syscalls from the discussion entirely, but the main usecases I have been presenting throughout this thread definitely stand.
>
> Thanks
>
>
> On Mon, Jul 4, 2022 at 10:54 AM Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>
>> Hi Wojtek,
>> thanks for sharing this list, I think this is a good starting point to understand more about your use case.
>>
>> Last week I've been looking at "getrusage" (as you mentioned it in an earlier email), and I was surprised to see that the call took a pointer to a (fairly big) struct which then needed to be initialized with some thread-local state:
>>
>> https://man7.org/linux/man-pages/man2/getrusage.2.html
>>
>> I've looked at the implementation, and it seems to be doing memset on the user-provided struct pointer, plus all the fields assignment. Eyeballing the implementation, this does not seem to me like a "classic" use case where dropping transition would help much. I mean, surely dropping transitions would help shaving some nanoseconds off the call, but it doesn't seem to me that the call would be shortlived enough to make a difference. Do you have some benchmarks on this one? I did some [1] and the call overhead seemed to come up at 260ns/op - w/o transition you might perhaps be able to get to 250ns, but that's in the noise?
>>
>> As for getpid, note that you can do (since Java 9):
>>
>> ProcessHandle.current().pid();
>>
>> I believe the impl caches the result, so it shouldn't even make the native call.
>>
>> Maurizio
>>
>> [1] - http://cr.openjdk.java.net/~mcimadamore/panama/GetrusageTest.java
>>
>> On 02/07/2022 07:42, Wojciech Kudla wrote:
>>
>> Hi Maurizio,
>>
>> Thanks for staying on this.
>>
>> > Could you please provide a rough list of the native calls you make where you believe critical JNI is having a real impact in the performance of your application?
>>
>> From the top of my head:
>> clock_gettime
>> recvmsg
>> recvmmsg
>> sendmsg
>> sendmmsg
>> select
>> getpid
>> getcpu
>> getrusage
>>
>> > Also, could you please tell us whether any of these calls need to interact with Java arrays?
>> No arrays or objects of any type involved. Everything happens by the means of passing raw pointers as longs and using other primitive types as function arguments.
>>
>> > In other words, do you use critical JNI to remove the cost associated with thread transitions, or are you also taking advantage of accessing on-heap memory _directly_ from native code?
>> Criticial JNI natives are used solely to remove the cost of transitions. We don't get anywhere near java heap in native code.
>>
>> In general I think it makes a lot of sense for Java as a language/platform to have some guards around unsafe code, but on the other hand the popularity of libraries employing Unsafe and their success in more performance-oriented corners of software engineering is a clear indicator there is a need for the JVM to provide access to more low-level primitives and mechanisms.
>> I think it's entirely fair to tell developers that all bets are off when they get into some non-idiomatic scenarios but please don't take away a feature that greatly contributed to Java's success.
>>
>> Kind regards,
>> Wojtek
>>
>> On Wed, Jun 29, 2022 at 5:20 PM Maurizio Cimadamore <maurizio.cimadamore at oracle.com> wrote:
>>>
>>> Hi Wojciech,
>>> picking up this thread again. After some internal discussion, we realize that we don't know enough about your use case. While re-enabling JNI critical would obviously provide a quick fix, we're afraid that (a) developers might end up depending on JNI critical when they don't need to (perhaps also unaware of the consequences of depending on it) and (b) that there might actually be _better_ (as in: much faster) solutions than using critical native calls to address at least some of your use cases (that seemed to be the case with the clock_gettime example you mentioned). Could you please provide a rough list of the native calls you make where you believe critical JNI is having a real impact in the performance of your application? Also, could you please tell us whether any of these calls need to interact with Java arrays? In other words, do you use critical JNI to remove the cost associated with thread transitions, or are you also taking advantage of accessing on-heap memory _directly_ from native code?
>>>
>>> Regards
>>> Maurizio
>>>
>>> On 13/06/2022 21:38, Wojciech Kudla wrote:
>>>
>>> Hi Mark,
>>>
>>> Thanks for your input and apologies for the delayed response.
>>>
>>> > If the platform included, say, an intrinsified System.nanoRealTime()
>>> method that returned clock_gettime(CLOCK_REALTIME), how much would
>>> that help developers in your unnamed industry?
>>>
>>> Exposing realtime clock with nanosecond granularity in the JDK would be a great step forward. I should have made it clear that I represent fintech corner (investment banking to be exact) but the issues my message touches upon span areas such as HPC, audio processing, gaming, and defense industry so it's not like we have an isolated case.
>>>
>>> > In a similar vein, if people are finding it necessary to “replace parts
>>> of NIO with hand-crafted native code” then it would be interesting to
>>> understand what their requirements are
>>>
>>> As for the other example I provided with making very short lived syscalls such as recvmsg/recvmmsg the premise is getting access to hardware timestamps on the ingress and egress ends as well as enabling batch receive with a single syscall and otherwise exploiting features unavailable from the JDK (like access to CMSG interface, scatter/gather, etc).
>>> There are also other examples of calls that we'd love to make often and at lowest possible cost (ie. getrusage) but I'm not sure if there's a strong case for some of these ideas, that's why it might be worth looking into more generic approach for performance sensitive code.
>>> Hope this does better job at explaining where we're coming from than my previous messages.
>>>
>>> Thanks,
>>> W
>>>
>>> On Tue, Jun 7, 2022 at 6:31 PM <mark.reinhold at oracle.com> wrote:
>>>>
>>>> 2022/6/6 0:24:17 -0700, wkudla.kernel at gmail.com:
>>>> >> Yes for System.nanoTime(), but System.currentTimeMillis() reports
>>>> >> CLOCK_REALTIME.
>>>> >
>>>> > Unfortunately System.currentTimeMillis() offers only millisecond
>>>> > granularity which is the reason why our industry has to resort to
>>>> > clock_gettime.
>>>>
>>>> If the platform included, say, an intrinsified System.nanoRealTime()
>>>> method that returned clock_gettime(CLOCK_REALTIME), how much would
>>>> that help developers in your unnamed industry?
>>>>
>>>> In a similar vein, if people are finding it necessary to “replace parts
>>>> of NIO with hand-crafted native code” then it would be interesting to
>>>> understand what their requirements are.  Some simple enhancements to
>>>> the NIO API would be much less costly to design and implement than a
>>>> generalized user-level native-call intrinsification mechanism.
>>>>
>>>> - Mark