RFR: 8295159: DSO created with -ffast-math breaks Java floating-point arithmetic [v7]

Tue Oct 25 13:33:56 UTC 2022

On Wed, 12 Oct 2022 17:00:15 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> A bug in GCC causes shared libraries linked with -ffast-math to disable denormal arithmetic. This breaks Java's floating-point semantics.
>> 
>> The bug is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522
>> 
>> One solution is to save and restore the floating-point control word around System.loadLibrary(). This isn't perfect, because some shared library might load another shared library at runtime, but it's a lot better than what we do now. 
>> 
>> However, this fix is not complete. `dlopen()` is called from many places in the JDK. I guess the best thing to do is find and wrap them all. I'd like to hear people's opinions.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8295159: DSO created with -ffast-math breaks Java floating-point arithmetic

I now have some performance results. `java.lang.foreign.CallOverheadConstant` is the test that I used to measure JNI overhead.

At present, without `-XX:+RestoreMXCSROnJNICalls`, it looks like this:

Benchmark                          Mode  Cnt  Score   Error  Units
CallOverheadConstant.jni_blank     avgt   40  9.968 ? 0.037  ns/op
CallOverheadConstant.panama_blank  avgt   40  8.745 ? 0.012  ns/op

Enabling `-XX:+RestoreMXCSROnJNICalls` makes the overhead much worse:

Benchmark                          Mode  Cnt   Score   Error  Units
CallOverheadConstant.jni_blank     avgt   40  14.741 ? 0.031  ns/op
CallOverheadConstant.panama_blank  avgt   40  14.620 ? 0.022  ns/op

and with JMH perfasm we can see why:

                0x00007f9f43d5698d:   sub    rsp,0x8
   1.56%        0x00007f9f43d56991:   vstmxcsr DWORD PTR [rsp]
  25.01%        0x00007f9f43d56996:   mov    eax,DWORD PTR [rsp]
  11.09%        0x00007f9f43d56999:   and    eax,0xffc0
                0x00007f9f43d5699e:   cmp    eax,DWORD PTR [rip+0xe02d234]        # 0x00007f9f51d83bd8

That adds 50% to the total JNI overhead. 70% to the Panama overhead.
25% of the total elapsed time is MXCSR! Reading MXCSR is expensive. So we don't do that.

So, after a lot of head scratching, I've invented an instruction sequence which doesn't read MXCSR but does a little arithmetic, and `-XX:+RestoreMXCSROnJNICalls` is:

CallOverheadConstant.jni_blank     avgt   40  10.675 ± 0.100  ns/op
CallOverheadConstant.panama_blank  avgt   40  10.284 ± 0.018  ns/op

Which is 7% added overhead for JNI, 17% for Panama. 1ns is 3.5 machine cycles: that's a bit less than the latency of a load from L1 cache.

I'm wondering if I could get away with fixing `RestoreMXCSROnJNICalls` and turning it on by default.

-------------

PR: https://git.openjdk.org/jdk/pull/10661