RFR: 8345219: C2: Avoid bailing to interpreter stubs for signalling NaNs on x86_64
Aleksey Shipilev
shade at openjdk.org
Thu Nov 28 18:46:14 UTC 2024
On Thu, 28 Nov 2024 18:22:24 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> Found this while cleaning up x86_32 code for removal.
>
> In our current code there is a block added by [JDK-8076373](https://bugs.openjdk.org/browse/JDK-8076373):
> https://github.com/openjdk/jdk/blob/3b21a298c29d88720f6bfb2dc1f3305b6a3db307/src/hotspot/share/compiler/compileBroker.cpp#L1451-L1473
>
> Ostensibly, that block is for x86_32 handling of signalling NaNs -- x87 FPU has a peculiarity with them. See other funky bugs we seen with it: [JDK-8285985](https://bugs.openjdk.org/browse/JDK-8285985), [JDK-8293991](https://bugs.openjdk.org/browse/JDK-8293991).
>
> But the way current block is coded, it is enabled for X86 wholesale, which also means x86_64! In fact, it is likely even worse on x86_64, because the related "fast" entries are generated only for x86_32:
> https://github.com/openjdk/jdk/blob/3b21a298c29d88720f6bfb2dc1f3305b6a3db307/src/hotspot/share/interpreter/templateInterpreterGenerator.cpp#L493-L502
>
> This can be solved by checking `IA32` instead of `X86`. This block would be gone completely once we remove x86_32 port. Meanwhile, we can make it right by x86_64, and make eventual x86_32 removal less confusing. This issue seems to only affect the compilation of native methods, while most of the hot code is riding on compiler intrinsics. I'll put performance data in comments.
>
> Additional testing:
> - [ ] Linux x86_64 server fastdebug, `all`
As expected, none of this matters when C2 intrinsics work:
Benchmark Mode Cnt Score Error Units
# Baseline
DoubleBitConversion.doubleToLongBits_NaN avgt 9 0.542 ± 0.001 ns/op
DoubleBitConversion.doubleToLongBits_one avgt 9 0.542 ± 0.001 ns/op
DoubleBitConversion.doubleToLongBits_zero avgt 9 0.542 ± 0.001 ns/op
DoubleBitConversion.doubleToRawLongBits_NaN avgt 9 0.420 ± 0.041 ns/op
DoubleBitConversion.doubleToRawLongBits_one avgt 9 0.413 ± 0.012 ns/op
DoubleBitConversion.doubleToRawLongBits_zero avgt 9 0.412 ± 0.020 ns/op
DoubleBitConversion.longBitsToDouble_NaN avgt 9 0.413 ± 0.007 ns/op
DoubleBitConversion.longBitsToDouble_one avgt 9 0.409 ± 0.007 ns/op
DoubleBitConversion.longBitsToDouble_zero avgt 9 0.414 ± 0.012 ns/op
FloatBitConversion.floatToIntBits_NaN avgt 9 0.542 ± 0.001 ns/op
FloatBitConversion.floatToIntBits_one avgt 9 0.542 ± 0.001 ns/op
FloatBitConversion.floatToIntBits_zero avgt 9 0.542 ± 0.001 ns/op
FloatBitConversion.floatToRawIntBits_NaN avgt 9 0.410 ± 0.005 ns/op
FloatBitConversion.floatToRawIntBits_one avgt 9 0.412 ± 0.008 ns/op
FloatBitConversion.floatToRawIntBits_zero avgt 9 0.413 ± 0.004 ns/op
FloatBitConversion.intBitsToFloat_NaN avgt 9 0.412 ± 0.008 ns/op
FloatBitConversion.intBitsToFloat_one avgt 9 0.413 ± 0.009 ns/op
FloatBitConversion.intBitsToFloat_zero avgt 9 0.421 ± 0.022 ns/op
# Patched
DoubleBitConversion.doubleToLongBits_NaN avgt 9 0.542 ± 0.001 ns/op
DoubleBitConversion.doubleToLongBits_one avgt 9 0.542 ± 0.001 ns/op
DoubleBitConversion.doubleToLongBits_zero avgt 9 0.542 ± 0.001 ns/op
DoubleBitConversion.doubleToRawLongBits_NaN avgt 9 0.425 ± 0.036 ns/op
DoubleBitConversion.doubleToRawLongBits_one avgt 9 0.418 ± 0.009 ns/op
DoubleBitConversion.doubleToRawLongBits_zero avgt 9 0.416 ± 0.017 ns/op
DoubleBitConversion.longBitsToDouble_NaN avgt 9 0.412 ± 0.004 ns/op
DoubleBitConversion.longBitsToDouble_one avgt 9 0.412 ± 0.010 ns/op
DoubleBitConversion.longBitsToDouble_zero avgt 9 0.414 ± 0.005 ns/op
FloatBitConversion.floatToIntBits_NaN avgt 9 0.542 ± 0.001 ns/op
FloatBitConversion.floatToIntBits_one avgt 9 0.542 ± 0.001 ns/op
FloatBitConversion.floatToIntBits_zero avgt 9 0.542 ± 0.001 ns/op
FloatBitConversion.floatToRawIntBits_NaN avgt 9 0.410 ± 0.005 ns/op
FloatBitConversion.floatToRawIntBits_one avgt 9 0.408 ± 0.007 ns/op
FloatBitConversion.floatToRawIntBits_zero avgt 9 0.413 ± 0.015 ns/op
FloatBitConversion.intBitsToFloat_NaN avgt 9 0.411 ± 0.008 ns/op
FloatBitConversion.intBitsToFloat_one avgt 9 0.409 ± 0.008 ns/op
FloatBitConversion.intBitsToFloat_zero avgt 9 0.426 ± 0.011 ns/op
It does matter a lot when the choice is to go through interpreter native entry (slow) or via compiled native adapter (fast):
# Baseline, -XX:-InlineMathNatives
DoubleBitConversion.doubleToLongBits_NaN avgt 9 0.604 ± 0.015 ns/op
DoubleBitConversion.doubleToLongBits_one avgt 9 97.382 ± 1.364 ns/op
DoubleBitConversion.doubleToLongBits_zero avgt 9 97.636 ± 2.620 ns/op
DoubleBitConversion.doubleToRawLongBits_NaN avgt 9 96.162 ± 0.513 ns/op
DoubleBitConversion.doubleToRawLongBits_one avgt 9 98.678 ± 3.378 ns/op
DoubleBitConversion.doubleToRawLongBits_zero avgt 9 97.374 ± 3.878 ns/op
DoubleBitConversion.longBitsToDouble_NaN avgt 9 96.753 ± 3.659 ns/op
DoubleBitConversion.longBitsToDouble_one avgt 9 97.173 ± 2.879 ns/op
DoubleBitConversion.longBitsToDouble_zero avgt 9 96.375 ± 2.150 ns/op
FloatBitConversion.floatToIntBits_NaN avgt 9 0.542 ± 0.001 ns/op
FloatBitConversion.floatToIntBits_one avgt 9 95.868 ± 2.192 ns/op
FloatBitConversion.floatToIntBits_zero avgt 9 97.377 ± 2.346 ns/op
FloatBitConversion.floatToRawIntBits_NaN avgt 9 95.947 ± 2.211 ns/op
FloatBitConversion.floatToRawIntBits_one avgt 9 97.705 ± 3.467 ns/op
FloatBitConversion.floatToRawIntBits_zero avgt 9 96.052 ± 2.359 ns/op
FloatBitConversion.intBitsToFloat_NaN avgt 9 98.793 ± 1.997 ns/op
FloatBitConversion.intBitsToFloat_one avgt 9 97.201 ± 2.327 ns/op
FloatBitConversion.intBitsToFloat_zero avgt 9 97.515 ± 1.939 ns/op
# Patched, -XX:-InlineMathNatives
DoubleBitConversion.doubleToLongBits_NaN avgt 9 0.598 ± 0.025 ns/op
DoubleBitConversion.doubleToLongBits_one avgt 9 4.508 ± 0.318 ns/op
DoubleBitConversion.doubleToLongBits_zero avgt 9 4.370 ± 0.003 ns/op
DoubleBitConversion.doubleToRawLongBits_NaN avgt 9 4.285 ± 0.295 ns/op
DoubleBitConversion.doubleToRawLongBits_one avgt 9 4.281 ± 0.331 ns/op
DoubleBitConversion.doubleToRawLongBits_zero avgt 9 4.155 ± 0.311 ns/op
DoubleBitConversion.longBitsToDouble_NaN avgt 9 4.592 ± 0.362 ns/op
DoubleBitConversion.longBitsToDouble_one avgt 9 4.815 ± 0.038 ns/op
DoubleBitConversion.longBitsToDouble_zero avgt 9 4.800 ± 0.019 ns/op
FloatBitConversion.floatToIntBits_NaN avgt 9 0.542 ± 0.001 ns/op
FloatBitConversion.floatToIntBits_one avgt 9 4.510 ± 0.322 ns/op
FloatBitConversion.floatToIntBits_zero avgt 9 4.501 ± 0.332 ns/op
FloatBitConversion.floatToRawIntBits_NaN avgt 9 4.280 ± 0.336 ns/op
FloatBitConversion.floatToRawIntBits_one avgt 9 4.278 ± 0.320 ns/op
FloatBitConversion.floatToRawIntBits_zero avgt 9 4.144 ± 0.329 ns/op
FloatBitConversion.intBitsToFloat_NaN avgt 9 4.551 ± 0.329 ns/op
FloatBitConversion.intBitsToFloat_one avgt 9 4.549 ± 0.327 ns/op
FloatBitConversion.intBitsToFloat_zero avgt 9 4.676 ± 0.328 ns/op
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22446#issuecomment-2506638455
More information about the hotspot-compiler-dev
mailing list