RFR: 8295159: DSO created with -ffast-math breaks Java floating-point arithmetic [v7]
Andrew Haley
aph at openjdk.org
Tue Oct 25 13:33:56 UTC 2022
On Wed, 12 Oct 2022 17:00:15 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> A bug in GCC causes shared libraries linked with -ffast-math to disable denormal arithmetic. This breaks Java's floating-point semantics.
>>
>> The bug is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522
>>
>> One solution is to save and restore the floating-point control word around System.loadLibrary(). This isn't perfect, because some shared library might load another shared library at runtime, but it's a lot better than what we do now.
>>
>> However, this fix is not complete. `dlopen()` is called from many places in the JDK. I guess the best thing to do is find and wrap them all. I'd like to hear people's opinions.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> 8295159: DSO created with -ffast-math breaks Java floating-point arithmetic
I now have some performance results. `java.lang.foreign.CallOverheadConstant` is the test that I used to measure JNI overhead.
At present, without `-XX:+RestoreMXCSROnJNICalls`, it looks like this:
Benchmark Mode Cnt Score Error Units
CallOverheadConstant.jni_blank avgt 40 9.968 ? 0.037 ns/op
CallOverheadConstant.panama_blank avgt 40 8.745 ? 0.012 ns/op
Enabling `-XX:+RestoreMXCSROnJNICalls` makes the overhead much worse:
Benchmark Mode Cnt Score Error Units
CallOverheadConstant.jni_blank avgt 40 14.741 ? 0.031 ns/op
CallOverheadConstant.panama_blank avgt 40 14.620 ? 0.022 ns/op
and with JMH perfasm we can see why:
0x00007f9f43d5698d: sub rsp,0x8
1.56% 0x00007f9f43d56991: vstmxcsr DWORD PTR [rsp]
25.01% 0x00007f9f43d56996: mov eax,DWORD PTR [rsp]
11.09% 0x00007f9f43d56999: and eax,0xffc0
0x00007f9f43d5699e: cmp eax,DWORD PTR [rip+0xe02d234] # 0x00007f9f51d83bd8
That adds 50% to the total JNI overhead. 70% to the Panama overhead.
25% of the total elapsed time is MXCSR! Reading MXCSR is expensive. So we don't do that.
So, after a lot of head scratching, I've invented an instruction sequence which doesn't read MXCSR but does a little arithmetic, and `-XX:+RestoreMXCSROnJNICalls` is:
CallOverheadConstant.jni_blank avgt 40 10.675 ± 0.100 ns/op
CallOverheadConstant.panama_blank avgt 40 10.284 ± 0.018 ns/op
Which is 7% added overhead for JNI, 17% for Panama. 1ns is 3.5 machine cycles: that's a bit less than the latency of a load from L1 cache.
I'm wondering if I could get away with fixing `RestoreMXCSROnJNICalls` and turning it on by default.
-------------
PR: https://git.openjdk.org/jdk/pull/10661
More information about the build-dev
mailing list