[lworld+fp16] RFR: 8341003: [lworld+fp16] Benchmarks for various Float16 operations [v2]

Thu Sep 26 10:46:43 UTC 2024

On Thu, 26 Sep 2024 09:05:00 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> - Adding micro-benchmarks for various Float16 operations.
>> - Adding similarity search targeting micro-benchmarks.
>> 
>> Please find below the results of performance testing over Intel Xeon6 Granite Rapids:-
>> 
>> 
>> Benchmark                                               (vectorDim)   Mode  Cnt      Score   Error   Units
>> Float16OpsBenchmark.absBenchmark                               1024  thrpt    2  25605.990          ops/ms
>> Float16OpsBenchmark.addBenchmark                               1024  thrpt    2  19222.468          ops/ms
>> Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2    528.738          ops/ms
>> Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    660.018          ops/ms
>> Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    659.799          ops/ms
>> Float16OpsBenchmark.divBenchmark                               1024  thrpt    2   1974.039          ops/ms
>> Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2    743.071          ops/ms
>> Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    682.440          ops/ms
>> Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2  14052.422          ops/ms
>> Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2   3851.234          ops/ms
>> Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2   1496.207          ops/ms
>> Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2   2778.822          ops/ms
>> Float16OpsBenchmark.maxBenchmark                               1024  thrpt    2  19231.326          ops/ms
>> Float16OpsBenchmark.minBenchmark                               1024  thrpt    2  19257.589          ops/ms
>> Float16OpsBenchmark.mulBenchmark                               1024  thrpt    2  19236.498          ops/ms
>> Float16OpsBenchmark.negateBenchmark                            1024  thrpt    2  25938.789          ops/ms
>> Float16OpsBenchmark.sqrtBenchmark                              1024  thrpt    2   1759.051          ops/ms
>> Float16OpsBenchmark.subBenchmark                               1024  thrpt    2  19242.967          ops/ms
>> 
>> 
>> Best Regrads,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update benchmark

Hi @Bhavana-Kilambi , I see vector IR in almost all the micros apart from three i.e. isNaN, isFinite and isInfinity with following command

`numactl --cpunodebind=1 -l java -jar target/benchmarks.jar  -jvmArgs "-XX:+TraceNewVectors" -p vectorDim=512 -f 1 -i 2 -wi 1 -w 30 org.openjdk.bench.java.lang.Float16OpsBenchmark.<BM_NAME>
`

Indicates Java implementation in some cases is not auto-vectorizing, after tuning we can verify with this benchmark.

Kindly let me know if the micro looks good, I can integrate it.

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1254#issuecomment-2376589198