[lworld+fp16] RFR: 8341003: [lworld+fp16] Benchmarks for various Float16 operations [v2]

Fri Sep 27 07:53:52 UTC 2024

On Fri, 27 Sep 2024 07:06:18 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

> > Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?
> > Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?
> 
> Hi @Bhavana-Kilambi , This patch adds **micro benchmarks** for all Float16 APIs optimized uptill now. **Macro-benchmarks** demonstrates use case for low precision semantic search primitives.

Hey,  for baseline we should not pass --enable-preview since it will prohibit following
- Flat layout of Float16 arrays.
- Creating valhalla specific IR needed for intrinsification.

Here are the first baseline numbers without --enable-primitive.

Benchmark                                               (vectorDim)   Mode  Cnt     Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2    99.424          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2    97.498          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2   525.360          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    51.132          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    46.921          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2    97.186          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2   583.051          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    56.133          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    81.386          ops/ms
Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2  2257.619          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2  3086.476          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2  1718.411          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2  1685.557          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2    92.078          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2    63.377          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2    98.202          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2    98.158          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2    83.760          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2    98.200          ops/ms

Following are the number where we do allow flat array layout, but only disable intrinsics (-XX:DisableIntrinsics=<INTIN_ID>+).

Benchmark                                               (vectorDim)   Mode  Cnt      Score   Error   Units
Float16OpsBenchmark.absBenchmark                               1024  thrpt    2  25978.876          ops/ms
Float16OpsBenchmark.addBenchmark                               1024  thrpt    2   6406.685          ops/ms
Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    528.877          ops/ms
Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2     76.680          ops/ms
Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2     53.692          ops/ms
Float16OpsBenchmark.divBenchmark                               1024  thrpt    2   3227.037          ops/ms
Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    740.490          ops/ms
Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2     83.747          ops/ms
Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    256.399          ops/ms
Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2   2135.678          ops/ms
Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3916.860          ops/ms
Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   1497.417          ops/ms
Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2   2747.704          ops/ms
Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2   3625.708          ops/ms
Float16OpsBenchmark.minBenchmark                               1024  thrpt    2   3628.261          ops/ms
Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2   6340.403          ops/ms
Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2  25727.870          ops/ms
Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2    157.519          ops/ms
Float16OpsBenchmark.subBenchmark                               1024  thrpt    2   6404.047          ops/ms

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1254#issuecomment-2378638423