Critical JNI and (Shenandoah) pinning questions

Sat Aug 24 17:24:48 UTC 2019

Hi Florian,

On Fri, 23 Aug 2019 at 13:04, Florian Weimer <fweimer at redhat.com> wrote:
> Why isn't the VZEROUPPER performed after using AVX2 registers?  It's
> supposed to cost approximately zero in that context.

No idea.

> Have you tried replacing the VZEROUPPER with a NOP of equal length?
> Maybe it's just an instruction alignment issue.

No and I'm not really qualified to do such low-level tuning. I'm only
trying to demonstrate that the current implementation is not ideal. A
JVM expert should review it, test across different hardware/arch's and
decide what the best approach is.

After further testing today, I've also identified the other source of
overhead in JDK 10+ compared to JDK 8: JDK-8213436 [1]. This explains
why non-critical JNI with the patched JDK 14 is not as fast as JDK 8.
The reasoning behind switching to UseMembar by default is sound, but
I'm wondering whether the memory barrier mask is unnecessarily strict.
It currently looks like this (all bits are used):

__ membar(Assembler::Membar_mask_bits(
            Assembler::LoadLoad | Assembler::LoadStore |
            Assembler::StoreLoad | Assembler::StoreStore));

Benchmark results on a Coffee Lake Xeon (better single-core performance
than my Ryzen):

JDK 8
    Standard JNI: ~4.3ns
    Critical JNI: ~4.3ns
JDK 8 -XX:+UseMembar
    Standard JNI: ~8.0ns
    Critical JNI: ~7.7ns
JDK 12
    Standard JNI: ~8.4ns
    Critical JNI: ~8.1ns
Patched JDK 14 with VZEROUPPER
    Standard JNI: ~8.4ns
    Critical JNI: ~3.4ns (!)
Patched JDK 14 without VZEROUPPER
    Standard JNI: ~7.9ns
    Critical JNI: ~2.9ns (!!)

Note that the overhead with VZEROUPPER is not that bad compared to
Ryzen, but it's still higher than without.

These findings suggest the following RFEs:

1. Skip check_needs_gc_for_critical_native() in primitive-only JNI
critical natives, regardless of GC algorithm and object-pinning
support. Without this change, CriticalJNINatives is completely useless
and actually dangerous.

2. Skip the switch to "native transition" and the safepoint polling in
primitive-only JNI critical natives.

3. Re-evaluate the use of VZEROUPPER instructions throughout the JNI
wrapper code. Could there be fewer of them? Could they be eliminated
entirely or emitted only for the specific CPU models that need them?
Will benefit both standard and critical JNI natives.

4. Re-evaluate the memory barrier emitted before the safepoint poll.
Could it be relaxed while preserving correctness? Will benefit standard
JNI (critical JNI natives skip the barrier with #2).

5. Backport any changes for #1-4 to JDK 8u & 11u. JDK 8 also needs
JDK-8167408 [2] and JDK-8167409 [3].

The "skip_native_trans" patch implements #1 and #2. I would also gladly
help with #5 if necessary (haven't signed the OCA yet, but I will).
Ideally, HotSpot engineers would take a look at #3 and #4 and test
everything thoroughly.

Thanks,

- Ioannis

[1] https://bugs.openjdk.java.net/browse/JDK-8213436
[2] https://bugs.openjdk.java.net/browse/JDK-8167408
[3] https://bugs.openjdk.java.net/browse/JDK-8167409