RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8]

Thu May 19 20:28:48 UTC 2022

On Thu, 19 May 2022 05:22:34 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:

> It sounds strange, please show the asm of your patch with respects to the benchmark. Also, please try cmoving with other arbitrary values such as 19 and 7 instead of `false` and `true`. The latter may be recognised as simple boolean not operation, remove the real comparison part, which defeats the purpose of Vladimir's suggestion.
> 
> Regarding vectorisation, `isNaN` is a simple comparison and can be easily auto-vectorised without help from intrinsics.
> 
> My speculation:
> 
> A native comparison such as `x != x` can be parsed directly by the compiler. As a result, the graph of the expression `if (x != x)` is simply
> 
> ```
> CmpF
>  |
> Bool
>  |
>  If
> ```
> 
> Your intrinsics, on the other hand, do not return the results on the flags, which leads to an extra comparison when using in conditions, `if(isNaN(x))` becomes
> 
> ```
> IsNaN    0
>     \   /
>     CmpI
>      |
>     Bool
>      |
>      If
> ```
> 
> In your benchmark, however, using this comparison to cmoving between 0 and 1 (`false` and `true`), the compiler recognised the pattern `x != 0 ? 0 : 1` with `x` having the type of `TypeInt::BOOL`. As a result, it reduces the graph into
> 
> ```
> IsNaN    1
>     \   /
>     XorI
> ```
> 
> Personally, I'm not into this implementation of intrinsics. FYI, gcc and clang both use sequences similar to `x != x` for `std::isnan`, `Math.abs(x) <= MAX_VALUE` for `std::isfinite` and `Math.abs(x) > MAX_VALUE` for `std::isinf`. The first one reduces to a single instruction `ucomiss x, x` so there is no reason to optimise further. The others are compiled down to 2 instructions each `vandpd t, x, [SIGN_ELIMINATE]; ucomiss t, [MAX_VALUE]`, so to optimise these further requires careful assessments.
> 
> If you feel comfortable I would suggest you build the graph for these intrinsics as
> 
> ```
>  X
>  |
> Bool   1    0
>    \   |   /
>      CMove
> ```
> 
> Then we can add ideal rules to `BoolNode` to recognise the patterns
> 
> ```
>  X
>  |
> Bool   1    0
>    \   |   /
>      CMove     0
>           \   /
>            CmpI
>             |
>            Bool
> ```
> 
> And reduce them to
> 
> ```
>  X
>  |
> Bool
> ```
> 
> With this, we can have the `Double::isInfinite` intrinsics compiled down to `vfpclass k, x; ktest k`, which is much more preferable. For non-AVX512DQ though I would prefer implementing them in Java similar to described above. Both abs and comparison nodes are not hard to be vectorised so it would not be a problem.
> 
> Thanks a lot.

As per the performance data, 
- for `isInfinite()`, the intrinsic is giving 60% speedup for `Float` (and 40% for `Double`) compared to the Java implementation.
- for` isFinite()`, the intrinsic is giving 20% speedup for `Float` (and 6% slowdown for `Double`) instead of the Java implementation (`return Math.abs(f) <= Float.MAX_VALUE`). 
- for `isNan()`, the flags fixup is skewing the results in favor of the intrinsic (5x for` Float `and 4x for `Double`) but you're saying that your patch will make this intrinsic approach unnecessary. 

I agree with your logic, but we also need to back it up with performance data. Could you please run my benchmark for `isNaN() `with your patch and provide the performance data? That way we can quantify the performance of your proposed optimization of `isNaN()`.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8459