RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8]

Thu May 19 05:24:43 UTC 2022

On Thu, 19 May 2022 00:06:00 GMT, Srinivas Vamsi Parasa <duke at openjdk.java.net> wrote:

>> We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions.
>> 
>> 
>> JMH Benchmark (ns/op)	              Baseline	  This PR (WITH vfpclassss/sd)                      Speedup
>> 		                               
>> FloatClassCheck.testIsFinite	      0.559	                  0.4	                            1.4x
>> FloatClassCheck.testIsInfinite	      0.828	                  0.386	                            2.15x
>> FloatClassCheck.testIsNaN	      2.589	                  0.387	                            6.7x
>> DoubleClassCheck.testIsFinite         0.568	                  0.414	                            1.37x
>> DoubleClassCheck.testIsInfinite       0.836	                  0.395	                            2.11x
>> DoubleClassCheck.testIsNaN	      2.592	                  0.393	                            6.6x
>> 
>> JMH Benchmark (ns/op)	              Baseline	  This PR (WITHOUT vfpclassss/sd)                   Speedup
>> FloatClassCheck.testIsFinite	      0.561	                 0.468	                             1.2x
>> FloatClassCheck.testIsInfinite	      0.793	                 0.491	                             1.61x
>> FloatClassCheck.testIsNaN	      2.587	                 0.469	                             5.5x
>> DoubleClassCheck.testIsFinite         0.561	                 0.592	                             0.94x
>> DoubleClassCheck.testIsInfinite       0.828	                 0.592	                             1.4x
>> DoubleClassCheck.testIsNaN	      2.593	                 0.594	                             4.4x
>
> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits:
> 
>  - add comment for vfpclasss/d for isFinite()
>  - Merge branch 'master' of https://git.openjdk.java.net/jdk into float
>  - zero out the upper bits not written by setb
>  - use 0x1 to be simpler
>  - remove the redundant temp register
>  - Split the macros using predicate
>  - update jmh tests
>  - Merge branch 'master' into float
>  - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite

It sounds strange, please show the asm of your patch with respects to the benchmark. Also, please try cmoving with other arbitrary values such as 19 and 7 instead of `false` and `true`. The latter may be recognised as simple boolean not operation, remove the real comparison part, which defeats the purpose of Vladimir's suggestion.

Regarding vectorisation, `isNaN` is a simple comparison and can be easily auto-vectorised without help from intrinsics.

My speculation:

A native comparison such as `x != x` can be parsed directly by the compiler. As a result, the graph of the expression `if (x != x)` is simply

    CmpF
     |
    Bool
     |
     If

Your intrinsics, on the other hand, do not return the results on the flags, which leads to an extra comparison when using in conditions, `if(isNaN(x))` becomes

    IsNaN    0
        \   /
        CmpI
         |
        Bool
         |
         If

In your benchmark, however, using this comparison to cmoving between 0 and 1 (`false` and `true`), the compiler recognised the pattern `x != 0 ? 0 : 1` with `x` having the type of `TypeInt::BOOL`. As a result, it reduces the graph into

    IsNaN    1
        \   /
        XorI

Personally, I'm not into this implementation of intrinsics. FYI, gcc and clang both use sequences similar to `x != x` for `std::isnan`, `Math.abs(x) <= MAX_VALUE` for `std::isfinite` and `Math.abs(x) > MAX_VALUE` for `std::isinf`. The first one reduces to a single instruction `ucomiss x, x` so there is no reason to optimise further. The others are compiled down to 2 instructions each `vandpd t, x, [SIGN_ELIMINATE]; ucomiss t, [MAX_VALUE]`, so to optimise these further requires careful assessments.

If you feel comfortable I would suggest you build the graph for these intrinsics as

     X
     |
    Bool   1    0
       \   |   /
         CMove

Then we can add ideal rules to `BoolNode` to recognise the patterns

     X
     |
    Bool   1    0
       \   |   /
         CMove     0
              \   /
               CmpI
                |
               Bool

And reduce them to

     X
     |
    Bool

With this, we can have the `Double::isInfinite` intrinsics compiled down to `vfpclass k, x; ktest k`, which is much more preferable. For non-AVX512DQ though I would prefer implementing them in Java similar to described above. Both abs and comparison nodes are not hard to be vectorised so it would not be a problem.

Thanks a lot.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8459