RFR: 8346236: Auto vectorization support for various Float16 operations
Jatin Bhateja
jbhateja at openjdk.org
Wed Feb 26 20:50:51 UTC 2025
This is a follow-up PR for https://github.com/openjdk/jdk/pull/22754
The patch adds support to vectorize various float16 scalar operations (add/subtract/divide/multiply/sqrt/fma).
Summary of changes included with the patch:
1. C2 compiler New Vector IR creation.
2. Auto-vectorization support.
3. x86 backend implementation.
4. New IR verification test for each newly supported vector operation.
Following are the performance numbers of Float16OperationsBenchmark
System : Intel(R) Xeon(R) Processor code-named Granite rapids
Frequency fixed at 2.5 GHz
Baseline
Benchmark (vectorDim) Mode Cnt Score Error Units
Float16OperationsBenchmark.absBenchmark 1024 thrpt 2 4191.787 ops/ms
Float16OperationsBenchmark.addBenchmark 1024 thrpt 2 1211.978 ops/ms
Float16OperationsBenchmark.cosineSimilarityDequantizedFP16 1024 thrpt 2 493.026 ops/ms
Float16OperationsBenchmark.cosineSimilarityDoubleRoundingFP16 1024 thrpt 2 612.430 ops/ms
Float16OperationsBenchmark.cosineSimilaritySingleRoundingFP16 1024 thrpt 2 616.012 ops/ms
Float16OperationsBenchmark.divBenchmark 1024 thrpt 2 604.882 ops/ms
Float16OperationsBenchmark.dotProductFP16 1024 thrpt 2 410.798 ops/ms
Float16OperationsBenchmark.euclideanDistanceDequantizedFP16 1024 thrpt 2 602.863 ops/ms
Float16OperationsBenchmark.euclideanDistanceFP16 1024 thrpt 2 640.348 ops/ms
Float16OperationsBenchmark.fmaBenchmark 1024 thrpt 2 809.175 ops/ms
Float16OperationsBenchmark.getExponentBenchmark 1024 thrpt 2 2682.764 ops/ms
Float16OperationsBenchmark.isFiniteBenchmark 1024 thrpt 2 3373.901 ops/ms
Float16OperationsBenchmark.isFiniteCMovBenchmark 1024 thrpt 2 1881.652 ops/ms
Float16OperationsBenchmark.isFiniteStoreBenchmark 1024 thrpt 2 2273.745 ops/ms
Float16OperationsBenchmark.isInfiniteBenchmark 1024 thrpt 2 2147.913 ops/ms
Float16OperationsBenchmark.isInfiniteCMovBenchmark 1024 thrpt 2 1962.579 ops/ms
Float16OperationsBenchmark.isInfiniteStoreBenchmark 1024 thrpt 2 1696.494 ops/ms
Float16OperationsBenchmark.isNaNBenchmark 1024 thrpt 2 2417.396 ops/ms
Float16OperationsBenchmark.isNaNCMovBenchmark 1024 thrpt 2 1708.585 ops/ms
Float16OperationsBenchmark.isNaNStoreBenchmark 1024 thrpt 2 2055.511 ops/ms
Float16OperationsBenchmark.maxBenchmark 1024 thrpt 2 1211.940 ops/ms
Float16OperationsBenchmark.minBenchmark 1024 thrpt 2 1212.063 ops/ms
Float16OperationsBenchmark.mulBenchmark 1024 thrpt 2 1211.955 ops/ms
Float16OperationsBenchmark.negateBenchmark 1024 thrpt 2 4215.922 ops/ms
Float16OperationsBenchmark.sqrtBenchmark 1024 thrpt 2 337.606 ops/ms
Float16OperationsBenchmark.subBenchmark 1024 thrpt 2 1212.467 ops/ms
Withopt:
Benchmark (vectorDim) Mode Cnt Score Error Units
Float16OperationsBenchmark.absBenchmark 1024 thrpt 2 28481.336 ops/ms
Float16OperationsBenchmark.addBenchmark 1024 thrpt 2 21311.633 ops/ms
Float16OperationsBenchmark.cosineSimilarityDequantizedFP16 1024 thrpt 2 489.324 ops/ms
Float16OperationsBenchmark.cosineSimilarityDoubleRoundingFP16 1024 thrpt 2 592.947 ops/ms
Float16OperationsBenchmark.cosineSimilaritySingleRoundingFP16 1024 thrpt 2 616.415 ops/ms
Float16OperationsBenchmark.divBenchmark 1024 thrpt 2 1991.958 ops/ms
Float16OperationsBenchmark.dotProductFP16 1024 thrpt 2 586.924 ops/ms
Float16OperationsBenchmark.euclideanDistanceDequantizedFP16 1024 thrpt 2 747.626 ops/ms
Float16OperationsBenchmark.euclideanDistanceFP16 1024 thrpt 2 635.823 ops/ms
Float16OperationsBenchmark.fmaBenchmark 1024 thrpt 2 15722.304 ops/ms
Float16OperationsBenchmark.getExponentBenchmark 1024 thrpt 2 2685.930 ops/ms
Float16OperationsBenchmark.isFiniteBenchmark 1024 thrpt 2 3455.726 ops/ms
Float16OperationsBenchmark.isFiniteCMovBenchmark 1024 thrpt 2 2026.590 ops/ms
Float16OperationsBenchmark.isFiniteStoreBenchmark 1024 thrpt 2 2265.065 ops/ms
Float16OperationsBenchmark.isInfiniteBenchmark 1024 thrpt 2 2140.280 ops/ms
Float16OperationsBenchmark.isInfiniteCMovBenchmark 1024 thrpt 2 2026.135 ops/ms
Float16OperationsBenchmark.isInfiniteStoreBenchmark 1024 thrpt 2 1340.694 ops/ms
Float16OperationsBenchmark.isNaNBenchmark 1024 thrpt 2 2432.249 ops/ms
Float16OperationsBenchmark.isNaNCMovBenchmark 1024 thrpt 2 1710.044 ops/ms
Float16OperationsBenchmark.isNaNStoreBenchmark 1024 thrpt 2 2055.544 ops/ms
Float16OperationsBenchmark.maxBenchmark 1024 thrpt 2 22170.178 ops/ms
Float16OperationsBenchmark.minBenchmark 1024 thrpt 2 21735.692 ops/ms
Float16OperationsBenchmark.mulBenchmark 1024 thrpt 2 22235.991 ops/ms
Float16OperationsBenchmark.negateBenchmark 1024 thrpt 2 27733.529 ops/ms
Float16OperationsBenchmark.sqrtBenchmark 1024 thrpt 2 1770.878 ops/ms
Float16OperationsBenchmark.subBenchmark 1024 thrpt 2 21800.058 ops/ms
Java implementation of Float16.isNaN is not auto-vectorizer friendly, existence of multiple conditional expressions prevents inferring conditional compare IR, while vectorization of Java implementation of Float16.isFinite and Float16.isInfinite APIs are possible on inferring VectorBlend for a contiguous pack of CMoveI IR in the presence of -XX:+UseVectorCmov and -XX:+UseCMoveUnconditionally runtime flags, we plan to optimize these APIs through scalar intrinsification and subsequent auto-vectorization support in a subsequent patch.
Kindly review and share your feedback.
Best Regards,
Jatin
-------------
Commit messages:
- Updating benchmark
- Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8346236
- Updating copyright
- Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8346236
- Add MinVHF/MaxVHF to commutative op list
- Auto Vectorization support for Float16 operations.
Changes: https://git.openjdk.org/jdk/pull/22755/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=22755&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8346236
Stats: 864 lines in 16 files changed: 801 ins; 10 del; 53 mod
Patch: https://git.openjdk.org/jdk/pull/22755.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/22755/head:pull/22755
PR: https://git.openjdk.org/jdk/pull/22755
More information about the hotspot-compiler-dev
mailing list