Critical JNI and (Shenandoah) pinning questions

Wed Aug 21 12:42:42 UTC 2019

Hi David and Zhengyu,

I have prepared a benchmark similar to what I've used in my testing,
here:

https://github.com/Spasi/JDK-8229895

The README includes instructions and some performance results that I'm
seeing locally. The called native functions do nothing at all and they
are tested both with and without CriticalJNINatives. The slowdown when
going from JDK 8 to JDK 10+ is obvious on Windows & Linux, but I
was not able to reproduce it on macOS.

- Ioannis

Ioannis Tsakpinis
Backend Engineer | WebHotelier
https://www.webhotelier.net

email:  iotsakp at gmail.com
skype:  iotsakp
phone:  +30 6949 191 475

On Wed, 21 Aug 2019 at 14:43, David Holmes <david.holmes at oracle.com> wrote:
>
> Hi Zhengyu,
>
> I don't subscribe to shenandoah-dev so this reply will only go to
> yourself, Ioannis and Aleksey (initially anyway).
>
> On 21/08/2019 9:19 pm, Zhengyu Gu wrote:
> > Hi David,
> >
> > I did not run the benchmark, Ioannis (cc'd) provided the numbers.
>
> It would be good to have information added to the bug - thanks.
>
> > What Ioannis proposed, does make sense to me. The implementation is
> > rather trivial, e.g. for x86_64:
> >
> > diff -r 348f7933e2cc src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp
> > --- a/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp      Tue Aug 20
> > 15:40:49 2019 +0100
> > +++ b/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp      Wed Aug 21
> > 07:12:55 2019 -0400
> > @@ -2044,6 +2044,8 @@
> >
> >     // Now the space for the inbound oop handle area
> >     int total_save_slots = 6 * VMRegImpl::slots_per_word;  // 6
> > arguments passed in registers
> > +  bool has_array_params = false;
> > +
> >     if (is_critical_native) {
> >       // Critical natives may have to call out so they need a save area
> >       // for register arguments.
> > @@ -2059,6 +2061,7 @@
> >             case T_CHAR:
> >             case T_INT:  single_slots++; break;
> >             case T_ARRAY:  // specific to LP64 (7145024)
> > +            has_array_params = true;
> >             case T_LONG: double_slots++; break;
> >             default:  ShouldNotReachHere();
> >           }
> > @@ -2231,7 +2234,7 @@
> >
> >     const Register oop_handle_reg = r14;
> >
> > -  if (is_critical_native &&
> > !Universe::heap()->supports_object_pinning()) {
> > +  if (is_critical_native && has_array_params &&
> > !Universe::heap()->supports_object_pinning()) {
> >       check_needs_gc_for_critical_native(masm, stack_slots,
> > total_c_args, total_in_args,
> >                                          oop_handle_offset, oop_maps,
> > in_regs, in_sig_bt);
> >     }
> >
> >
> > What do you think?
> It seems like a simple proposition but I'm not familiar enough with this
> aspect of critical natives to really comment at this stage.
>
> Thanks,
> David
>
> > Thanks,
> >
> > -Zhengyu
> >
> >
> >
> > On 8/19/19 3:52 PM, Zhengyu Gu wrote:
> >> Hi Ioannis,
> >>
> >> Ah, I got what you proposed now. I filed following bug, hopefully, to
> >> spark some discussions.
> >>
> >> https://bugs.openjdk.java.net/browse/JDK-8229895
> >>
> >> Thanks,
> >>
> >> -Zhengyu
> >>
> >> On 8/19/19 12:05 PM, Ioannis Tsakpinis wrote:
> >>> Hey Zhengyu,
> >>>
> >>>> If there are no array parameters, there is no point to use
> >>>> Get/SetXXXArrayCritical or native critical methods in the first place.
> >>>
> >>> It's true that CriticalJNINatives were added as an efficient way to
> >>> access Java arrays from JNI code. However, the overhead of JNI calls
> >>> affects all methods, especially methods that accept or return primitive
> >>> values only and the JNI code does nothing but pass the arguments to
> >>> another native function.
> >>>
> >>> There are thousands of JNI functions in LWJGL and almost all are like
> >>> that, they simply cast arguments to the appropriate type and pass them
> >>> to a target native function. Libraries like JNR and other JNI binding
> >>> generators also look the same.
> >>>
> >>> The major benefit of using CriticalJNINatives for such functions is the
> >>> removal of the first two standard JNI parameters: JNIEnv* and jclass.
> >>> Normally that would only mean less register pressure, which may help in
> >>> some cases. In practice though, native compilers are able to optimize
> >>> away any argument shuffling and convert everything to a simple
> >>> tail-call, i.e. a single jump instruction.
> >>>
> >>> We go from this for standard JNI:
> >>>
> >>> Java -> shuffle arguments -> JNI -> shuffle arguments -> native call
> >>>
> >>> to this for critical JNI:
> >>>
> >>> Java -> shuffle arguments -> JNI -> native call
> >>>
> >>> Example code and assembly output: https://godbolt.org/z/qZRIi1
> >>>
> >>> This has a measurable effect on JNI call overhead and becomes more
> >>> important the simpler the target native function is. With Project Panama
> >>> there is no JNI function and it should be possible to optimize the first
> >>> argument shuffling too. Until then, this is the best we can do, unless
> >>> there are opportunities to slim down the JNI wrapper even further for
> >>> critical native methods (e.g. remove the safepoint polling if it's safe
> >>> to do so).
> >>>
> >>> To sum up, the motivation is reduced JNI overhead. My argument is that
> >>> primitive-only functions could benefit from significant overhead
> >>> reduction with CriticalJNINatives. However, the GC locking effect is a
> >>> major and unnecessary disadvantage. Shenandoah does a perfect job here
> >>> because it supports region pinning and there's no actual locking
> >>> happening in primitive-only functions. Every other GC though will choke
> >>> hard with applications that make heavy use of critical natives (such as
> >>> typical LWJGL applications). So, two requests:
> >>>
> >>> - PRIMARY: Skip check_needs_gc_for_critical_native() in primitive-only
> >>> functions, regardless of GC algorithm and object-pinning support.
> >>>
> >>> - BONUS: JNI call overhead is significantly higher (3-4ns) on Java 10+
> >>> compared to Java 8 (with or without critical natives). I went through
> >>> the timeline of sharedRuntime_x86_64.cpp but couldn't spot anything that
> >>> would justify such a difference (thread-local handshakes maybe?). I was
> >>> wondering if this is a performance regression that needs to be looked
> >>> into.
> >>>
> >>> Thank you,
> >>>
> >>> - Ioannis
> >>>