RFR: 8292591: Experimentally add back barrier-less Java thread transitions

Fri Sep 2 06:40:07 UTC 2022

On Thu, 1 Sep 2022 16:47:58 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

> Please consider, only implemented on x64/aarch64 linux/windows.
> 
> On my box calling clock_gettime via JNI goes from 35ns to 28ns when enabled.
> 
> Passes t1-7 with option forced on, also passes t1-4 as is in this PR.

Yes, sorry. Since we had this code before I forgot adding an explanation.
This gives back around 75% of the gained performance of transitions-less JNI calls. (in one benchmark critical gives 8ns, this gave 6ns (on JDK17))
Note that is against accidentally optimized JNI critical (removal was done in steps, before the final step it was faster than the original implementation, it was never intended to make it faster).
So it should be even closer the original pre-JDK 17 numbers.
But note that this applies to all JNI methods, not just some special ones.

For safepoints poll the Java thread do:
1: Store an unsafe thread state as indication that we are entering the VM.
2: Check if entrance into the VM can be performed safely.

VM Thread (or a handshaker) do:
1: Store polling word
2: Read the thread state

This must be executed in order where 1 happens before 2.

store unsafe thread state
store_load_barrier
load poll

store poll
store_load_barrier
load thread state

This patch moves store_load_barrier to the read of the thread state by the use of system memory barrier, which make sure we get program order: "guarantee that all its running thread siblings have passed through a state where all memory accesses to user-space addresses match program order"

store unsafe thread state
compiler_barrier
load poll

store poll
system_memory_barrier
load thread state

As you said this big hammer have downside since it always must be emitted before thread the thread state.
Such as:
* Using JFR sampler with short periods, or sampling many threads.
* Workload with many per seconds safepoints or handshake

Which means your overall performance may suffer and only a few special workloads should notice a difference at all.

I have not changed all transitions to elide the store_load, since they are not performance impacting and this PR was focused on native transitions.
If you think all transitions should honor this flag I can do a follow-up.

-------------

PR: https://git.openjdk.org/jdk/pull/10123