[lworld+fp16] RFR: 8341003: [lworld+fp16] Benchmarks for various Float16 operations [v2]

Fri Sep 27 08:08:54 UTC 2024

On Fri, 27 Sep 2024 07:51:12 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>>> Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?
>>> 
>>> Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?
>> 
>> Hi @Bhavana-Kilambi , This patch adds **micro benchmarks** for all Float16 APIs optimized uptill now.
>> **Macro-benchmarks** demonstrates use case for low precision semantic search primitives.
>
>> > Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?
>> > Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?
>> 
>> Hi @Bhavana-Kilambi , This patch adds **micro benchmarks** for all Float16 APIs optimized uptill now. **Macro-benchmarks** demonstrates use case for low precision semantic search primitives.
> 
> Hey,  for baseline we should not pass --enable-preview since it will prohibit following
> - Flat layout of Float16 arrays.
> - Creating valhalla specific IR needed for intrinsification.
> 
> Here are the first baseline numbers without --enable-primitive.
> 
> 
> Benchmark                                               (vectorDim)   Mode  Cnt     Score   Error   Units
> Float16OpsBenchmark.absBenchmark                               1024  thrpt    2    99.424          ops/ms
> Float16OpsBenchmark.addBenchmark                               1024  thrpt    2    97.498          ops/ms
> Float16OpsBenchmark.cosineSimilarityDequantizedFP16            1024  thrpt    2   525.360          ops/ms
> Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16         1024  thrpt    2    51.132          ops/ms
> Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16         1024  thrpt    2    46.921          ops/ms
> Float16OpsBenchmark.divBenchmark                               1024  thrpt    2    97.186          ops/ms
> Float16OpsBenchmark.euclideanDistanceDequantizedFP16           1024  thrpt    2   583.051          ops/ms
> Float16OpsBenchmark.euclideanDistanceFP16                      1024  thrpt    2    56.133          ops/ms
> Float16OpsBenchmark.fmaBenchmark                               1024  thrpt    2    81.386          ops/ms
> Float16OpsBenchmark.getExponentBenchmark                       1024  thrpt    2  2257.619          ops/ms
> Float16OpsBenchmark.isFiniteBenchmark                          1024  thrpt    2  3086.476          ops/ms
> Float16OpsBenchmark.isInfiniteBenchmark                        1024  thrpt    2  1718.411          ops/ms
> Float16OpsBenchmark.isNaNBenchmark                             1024  thrpt    2  1685.557          ops/ms
> Float16OpsBenchmark.maxBenchma...

> @jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?

Hey, we have some ideas, but for now my intent is to add micros/few demonstrating macro for each API we have accelerated.

-------------

PR Comment: https://git.openjdk.org/valhalla/pull/1254#issuecomment-2378652894