RFR: 8295159: DSO created with -ffast-math breaks Java floating-point arithmetic [v7]
Vladimir Ivanov
vlivanov at openjdk.org
Tue Oct 18 18:34:27 UTC 2022
On Wed, 12 Oct 2022 17:00:15 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> A bug in GCC causes shared libraries linked with -ffast-math to disable denormal arithmetic. This breaks Java's floating-point semantics.
>>
>> The bug is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522
>>
>> One solution is to save and restore the floating-point control word around System.loadLibrary(). This isn't perfect, because some shared library might load another shared library at runtime, but it's a lot better than what we do now.
>>
>> However, this fix is not complete. `dlopen()` is called from many places in the JDK. I guess the best thing to do is find and wrap them all. I'd like to hear people's opinions.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> 8295159: DSO created with -ffast-math breaks Java floating-point arithmetic
So, IMO the discussion boils down to how we want a misbehaving native library to be handled by the JVM.
The ABI lists MXCSR as a callee-saved register, so there's nothing wrong on JVM side from that perspective.
>From a quality of implementation perspective though, JVM could do a better job at catching broken libraries. Of course, there are numerous ways for a native code to break the JVM, but in this particular case, it looks trivial to catch the problem. The question is how much overhead we can afford to introduce for that. Whether it should be an opt-in solution (e.g., `-Xcheck:jni` or `-XX:+AlwaysRestoreFPU`/`-XX:+RestoreMXCSROnJNICalls`), opt-out (unconditionally recover or report an error when FP env is corrupted, optionally providing a way to turn it off), or apply a band-aid fix just to fix the immediate problem with GCC's fast-math mode.
I'd like to dissuade from going with just a band-aid fix (we already went through that multiple times with different level of success) and try to improve the overall experience JVM provides. It feels like just pushing the problem further away and it would be very unfortunate to repeat the very same exercise in the future.
My preferred solution would be to automatically detect the corruption and restore MXCSR register across a JNI call, but if it turns out to be too expensive, JVM could check for MXCSR register corruption after every JNI call and crash issuing a message with diagnostic details about where corruption happened (info about library and entry) offering to turn on `-XX:+AlwaysRestoreFPU`/`-XX:+RestoreMXCSROnJNICalls` as a stop-the-gap solution. It would send users a clear signal there's something wrong with their code/environment, but still giving them an option to workaround the problem while fixing the issue.
Saying that, I'd like to stress that I'm perfectly fine with addressing the general issue of misbehaving native libraries separately (if we agree it's worth it) and I trust @dholmes-ora and @theRealAph to choose the most appropriate fix for this particular bug.
-------------
PR: https://git.openjdk.org/jdk/pull/10661
More information about the build-dev
mailing list