RFR: 8346236: Auto vectorization support for various Float16 operations

Wed Feb 26 20:50:51 UTC 2025

This is a follow-up PR for https://github.com/openjdk/jdk/pull/22754

The patch adds support to vectorize various float16 scalar operations (add/subtract/divide/multiply/sqrt/fma).

Summary of changes included with the patch:
   1. C2 compiler New Vector IR creation.
   2. Auto-vectorization support.
   3. x86 backend implementation.
   4. New IR verification test for each newly supported vector operation.

Following are the performance numbers of Float16OperationsBenchmark

System : Intel(R) Xeon(R) Processor code-named Granite rapids
Frequency fixed at 2.5 GHz

Baseline
Benchmark                                                      (vectorDim)   Mode  Cnt     Score   Error   Units
Float16OperationsBenchmark.absBenchmark                               1024  thrpt    2  4191.787          ops/ms
Float16OperationsBenchmark.addBenchmark                               1024  thrpt    2  1211.978          ops/ms
Float16OperationsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2   493.026          ops/ms
Float16OperationsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2   612.430          ops/ms
Float16OperationsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2   616.012          ops/ms
Float16OperationsBenchmark.divBenchmark                               1024  thrpt    2   604.882          ops/ms
Float16OperationsBenchmark.dotProductFP16                             1024  thrpt    2   410.798          ops/ms
Float16OperationsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2   602.863          ops/ms
Float16OperationsBenchmark.euclideanDistanceFP16                      1024  thrpt    2   640.348          ops/ms
Float16OperationsBenchmark.fmaBenchmark                               1024  thrpt    2   809.175          ops/ms
Float16OperationsBenchmark.getExponentBenchmark                       1024  thrpt    2  2682.764          ops/ms
Float16OperationsBenchmark.isFiniteBenchmark                          1024  thrpt    2  3373.901          ops/ms
Float16OperationsBenchmark.isFiniteCMovBenchmark                      1024  thrpt    2  1881.652          ops/ms
Float16OperationsBenchmark.isFiniteStoreBenchmark                     1024  thrpt    2  2273.745          ops/ms
Float16OperationsBenchmark.isInfiniteBenchmark                        1024  thrpt    2  2147.913          ops/ms
Float16OperationsBenchmark.isInfiniteCMovBenchmark                    1024  thrpt    2  1962.579          ops/ms
Float16OperationsBenchmark.isInfiniteStoreBenchmark                   1024  thrpt    2  1696.494          ops/ms
Float16OperationsBenchmark.isNaNBenchmark                             1024  thrpt    2  2417.396          ops/ms
Float16OperationsBenchmark.isNaNCMovBenchmark                         1024  thrpt    2  1708.585          ops/ms
Float16OperationsBenchmark.isNaNStoreBenchmark                        1024  thrpt    2  2055.511          ops/ms
Float16OperationsBenchmark.maxBenchmark                               1024  thrpt    2  1211.940          ops/ms
Float16OperationsBenchmark.minBenchmark                               1024  thrpt    2  1212.063          ops/ms
Float16OperationsBenchmark.mulBenchmark                               1024  thrpt    2  1211.955          ops/ms
Float16OperationsBenchmark.negateBenchmark                            1024  thrpt    2  4215.922          ops/ms
Float16OperationsBenchmark.sqrtBenchmark                              1024  thrpt    2   337.606          ops/ms
Float16OperationsBenchmark.subBenchmark                               1024  thrpt    2  1212.467          ops/ms

Withopt:
Benchmark                                                      (vectorDim)   Mode  Cnt      Score   Error   Units
Float16OperationsBenchmark.absBenchmark                               1024  thrpt    2  28481.336          ops/ms
Float16OperationsBenchmark.addBenchmark                               1024  thrpt    2  21311.633          ops/ms
Float16OperationsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    489.324          ops/ms
Float16OperationsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    592.947          ops/ms
Float16OperationsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    616.415          ops/ms
Float16OperationsBenchmark.divBenchmark                               1024  thrpt    2   1991.958          ops/ms
Float16OperationsBenchmark.dotProductFP16                             1024  thrpt    2    586.924          ops/ms
Float16OperationsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    747.626          ops/ms
Float16OperationsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    635.823          ops/ms
Float16OperationsBenchmark.fmaBenchmark                               1024  thrpt    2  15722.304          ops/ms
Float16OperationsBenchmark.getExponentBenchmark                       1024  thrpt    2   2685.930          ops/ms
Float16OperationsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3455.726          ops/ms
Float16OperationsBenchmark.isFiniteCMovBenchmark                      1024  thrpt    2   2026.590          ops/ms
Float16OperationsBenchmark.isFiniteStoreBenchmark                     1024  thrpt    2   2265.065          ops/ms
Float16OperationsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   2140.280          ops/ms
Float16OperationsBenchmark.isInfiniteCMovBenchmark                    1024  thrpt    2   2026.135          ops/ms
Float16OperationsBenchmark.isInfiniteStoreBenchmark                   1024  thrpt    2   1340.694          ops/ms
Float16OperationsBenchmark.isNaNBenchmark                             1024  thrpt    2   2432.249          ops/ms
Float16OperationsBenchmark.isNaNCMovBenchmark                         1024  thrpt    2   1710.044          ops/ms
Float16OperationsBenchmark.isNaNStoreBenchmark                        1024  thrpt    2   2055.544          ops/ms
Float16OperationsBenchmark.maxBenchmark                               1024  thrpt    2  22170.178          ops/ms
Float16OperationsBenchmark.minBenchmark                               1024  thrpt    2  21735.692          ops/ms
Float16OperationsBenchmark.mulBenchmark                               1024  thrpt    2  22235.991          ops/ms
Float16OperationsBenchmark.negateBenchmark                            1024  thrpt    2  27733.529          ops/ms
Float16OperationsBenchmark.sqrtBenchmark                              1024  thrpt    2   1770.878          ops/ms
Float16OperationsBenchmark.subBenchmark                               1024  thrpt    2  21800.058          ops/ms

Java implementation of Float16.isNaN is not auto-vectorizer friendly, existence of multiple conditional expressions prevents inferring conditional compare IR, while vectorization of Java implementation of Float16.isFinite and Float16.isInfinite APIs are possible on inferring VectorBlend for a contiguous pack of CMoveI IR in the presence of  -XX:+UseVectorCmov and -XX:+UseCMoveUnconditionally runtime flags, we plan to optimize these APIs through scalar intrinsification and subsequent auto-vectorization support in a subsequent patch.

Kindly review and share your feedback.

Best Regards,
Jatin

-------------

Commit messages:
 - Updating benchmark
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8346236
 - Updating copyright
 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8346236
 - Add MinVHF/MaxVHF to commutative op list
 - Auto Vectorization support for Float16 operations.

Changes: https://git.openjdk.org/jdk/pull/22755/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22755&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8346236
  Stats: 864 lines in 16 files changed: 801 ins; 10 del; 53 mod
  Patch: https://git.openjdk.org/jdk/pull/22755.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/22755/head:pull/22755

PR: https://git.openjdk.org/jdk/pull/22755