RFR: 8306302: C2 Superword fix: use VectorMaskCmp and VectorBlend instead of CMoveVF/D

Tue May 9 06:20:24 UTC 2023

**Bug**
In `x86`, `CMoveVF/D` were not correctly implemented for the `eq` and `neq` case (leads to assert). And the `lt/le/gt/ge` cases did not all handle `NaN's` correctly (ordered vs unordered comparision, leads to wrong results).

The assert gets triggered in the code from this change: [JDK-8285973](https://bugs.openjdk.org/browse/JDK-8285973)
On this line: https://github.com/openjdk/jdk/commit/c1db70d827f7ac81aa6c6646e2431f672c71c8dc#diff-e5266a3774f26ac663dcc67e0be18608b1735f38c0576673ce36e0cd689bab4aR4309

The problematic line wants to find a Cmp above the Bool, and compare its inputs. But we have no Cmp there, just a constant, that we have set during matching:
https://github.com/openjdk/jdk/blob/af4d5600e37ec6d331e62c5d37491ee97cad5311/src/hotspot/share/opto/matcher.cpp#L2394

The wrong results with `NaN` are because of a bug in `x`:
https://github.com/openjdk/jdk/commit/0485593fbc4a3264b79969de192e8e7d36e5b590#diff-7070c036c7d88ba4a8467e404d8d88aee646b97bf7bacc8b73cbc93f3ef11d2dR2106
The cases `lt` and `le` include the `-1` case, which shoud return `true` if any comparison input is a `NaN`, just as defined for java bytecode `fcmpl/dcmpl`. But they were mapped to ordered comparison codes, not unordered ones. More [here](https://bugs.openjdk.org/browse/JDK-8306302?focusedCommentId=14579078&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14579078).

**Solution**
@fg1417 suggested that `CMoveVF/D` is perfectly composed of `VectorMaskCmp + VectorBlend`. So instead of fixing `CMoveVF/D`, I replaced it. Performance should be the same, as it goes down to the same assembly instructions.

This has a few benefits:
 - `VectorMaskCmp + VectorBlend` is more powerful:
   - `CMoveVF/D` required the same inputs to the compare than to the move itself.
   - `CMoveVF/D` on x86 was only implemented for 32 bytes. Any other size would simply fail to vectorize.
   - `VectorMaskCmp` and `VectorBlend` can have different compare inputs, and even different types. For now, the input types must have the same data-width (`float` and `int`, `double` and `long`).
   - We need less code (I completely removed all code for `CMoveVF/D`).

I also moved the whole `CMove` code in `SuperWord` into `SuperWord::output`, rather than the complex code `SuperWord::merge_packs_to_cmove / CMoveKit`.

As reported in [JDK-8306088](https://bugs.openjdk.org/browse/JDK-8306088) https://github.com/openjdk/jdk/pull/13354, the CMove code did not properly maintain the `packset` / `my_pack`. I now added some verification here, since I also just removed the problematic code, and the verification passes now with this patch. I was also able to remove the unwanted `UseVectorCmov` in an assert.

Most of the changes come from the regression tests in `TestVectorConditionalMove.java`: I generalized it from `aarch64` to all platforms, the IR rules only apply with `avx/asimd`. And I added many new tests to cover the newly implemented cases. Further, I modified the tests to include `NaN's` among the random numbers, to verify that the ordered/unordered comparisions are correct.

**Discussion / Context / Future Work**

1. From what I understand, we currently never introduce a `CMoveF/D`, unless asked for by `UseCMoveUnconditionally` (`C->use_c_move()`). If the flag is set, we attribute no cost to the CMove, else we take `Matcher::float_cmove_cost()`, which seems to be `ConditionalMoveLimit`, and so the Phi is never converted into a CMove.

   An then if one wants to convert these scalar-CMove into a vector-CMove, one needs to activate the flag `UseVectorCmov`. @vnkozlov did some research: the goal was always to have this be on by default eventually.

   I see 2 paths here: either we obsolete `UseVectorCmov`, and implicitly have it on. Or we keep it, but make it by default on. I can do some performance measurements in a follow-up **RFE**.

2. I also saw that `int` and `long` are also CMove'd in `PhaseIdealLoop::conditional_move`. Especially `int` can currently be CMove'd without the `UseCMoveUnconditionally` flag. It would be nice to allow them to be vectorized. This is a small fix, but I'd like to do the testing and performance analysis for it. So a separate **RFE**. A slightly more involved idea: also allow cmp / blend with types of different widths (eg. compare `int` but cmove `double`). That would require a cast on the vector-mask.

3. It is a shame that scalar-CMove is on its own usually not profitable. But together with `SuperWord` it would be profitable. But if it is not scalar-CMove'd first, we fail to vectorize, since the loop has control-flow. It is one of my dreams: allow `SuperWord` to handle control flow, and to the `If-conversion` (CMove) directly with vectorization. Let me know if you have any thoughts or ideas.

-------------

Commit messages:
 - remove UseVectorCmov condition in assert from JDK-8304720
 - Merge branch 'master' into JDK-8306088
 - fix type for VectorMaskCmpNode
 - whitespaces: replaced tabs with spaces
 - small refactoring / improved comments
 - Restrict Cmp and Bool for CMove
 - Add NaN test, and fix the NaN bug (it also existed with CMoveVF/D)
 - remove useless and buggy cmpOp_vcmppd
 - Merge branch 'master' into JDK-8306088
 - UseVectorCmov can block Cmp in SuperWord::implemented
 - ... and 14 more: https://git.openjdk.org/jdk/compare/bb3e44d8...81d4de72

Changes: https://git.openjdk.org/jdk/pull/13493/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13493&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8306302
  Stats: 1370 lines in 13 files changed: 787 ins; 550 del; 33 mod
  Patch: https://git.openjdk.org/jdk/pull/13493.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13493/head:pull/13493

PR: https://git.openjdk.org/jdk/pull/13493