RFR: 8295159: DSO created with -ffast-math breaks Java floating-point arithmetic [v7]

Thu Oct 20 20:30:13 UTC 2022

On Wed, 12 Oct 2022 17:00:15 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> A bug in GCC causes shared libraries linked with -ffast-math to disable denormal arithmetic. This breaks Java's floating-point semantics.
>> 
>> The bug is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522
>> 
>> One solution is to save and restore the floating-point control word around System.loadLibrary(). This isn't perfect, because some shared library might load another shared library at runtime, but it's a lot better than what we do now. 
>> 
>> However, this fix is not complete. `dlopen()` is called from many places in the JDK. I guess the best thing to do is find and wrap them all. I'd like to hear people's opinions.
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8295159: DSO created with -ffast-math breaks Java floating-point arithmetic

That sounds like a very interesting idea. 

It would be very helpful to get an understanding how much overhead `STMXCSR` plus a branch adds in JNI stub to decide whether it's worth optimizing for.

Call stub already employs an optimization to save on writing to MXCSR:

    Label skip_ldmx;
    __ stmxcsr(mxcsr_save);
    __ movl(rax, mxcsr_save);
    __ andl(rax, 0xFFC0); // Mask out any pending exceptions (only check control and mask bits)
    ExternalAddress mxcsr_std(StubRoutines::x86::addr_mxcsr_std());
    __ cmp32(rax, mxcsr_std, rscratch1);
    __ jcc(Assembler::equal, skip_ldmx);
    __ ldmxcsr(mxcsr_std, rscratch1);
    __ bind(skip_ldmx);

According to [uops.info](https://uops.info/html-instr/STMXCSR_M32.html), latencies for `STMXCSR` vary from 7-12 cycles on Intel to up to 20 on AMD. I haven't found any details about the actual implementations in silicon (can't confirm it serializes the execution), so I'm curious how much branch prediction can hide the latency in this particular case.

If it turns out to be worth optimizing `STMXCSR` away, I see other problematic cases:

StubRoutines::x86::_mxcsr_std = 0x1F80;

// MXCSR.b  10987654321098765432109876543210
// 0xFFC0   00000000000000001111111111000000 // mask
// 0x1F80   00000000000000000001111110000000 // MXCSR value used by JVM
// 0x8040   00000000000000001000000001000000 // the bits -ffast-math mode unconditionally sets

// MXCSR bits:
// 15    FTZ Flush to Zero                       0 = 1
// 14:13 RC  Rounding Control                   00
// 12    PM  Precision Exception Mask            1
// 11    UM  Underflow Exception Mask            1
// 10    OM  Overflow Exception Mask             1
// 9     ZM  Zero-Divide Exception Mask          1
// 8     DM  Denormalized-Operand Exception Mask 1
// 7     IM  Invalid-Operation Exception Mask    1
// 6     DAZ Denormals Are Zeros                 0 = 1

The GCC bugs with `-ffast-math` only corrupts `FTZ` and `DAZ`. 

But `RC` and exception masks may be corrupted as well the same way and I believe the consequences are be similar (silent divergence in results during FP computations).

-------------

PR: https://git.openjdk.org/jdk/pull/10661