[lworld+fp16] RFR: 8341003: [lworld+fp16] Benchmarks for various Float16 operations [v2]
Jatin Bhateja
jbhateja at openjdk.org
Fri Sep 27 08:08:54 UTC 2024
On Fri, 27 Sep 2024 07:51:12 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>>> Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?
>>>
>>> Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?
>>
>> Hi @Bhavana-Kilambi , This patch adds **micro benchmarks** for all Float16 APIs optimized uptill now.
>> **Macro-benchmarks** demonstrates use case for low precision semantic search primitives.
>
>> > Hi @jatin-bhateja , thanks for doing the micros. Can I please ask why are you benchmarking/testing the cosine similarity tests specifically? Are there any real world usecases which are similar to these for FP16 for which you have written these smaller benchmark kernels?
>> > Also, regarding the performance results you posted for the Intel machine, have you compared it with anything else (like the default FP32 implementation for FP16/case without the intrinsics or the scalar FP16 version) so that we can better interpret the scores?
>>
>> Hi @Bhavana-Kilambi , This patch adds **micro benchmarks** for all Float16 APIs optimized uptill now. **Macro-benchmarks** demonstrates use case for low precision semantic search primitives.
>
> Hey, for baseline we should not pass --enable-preview since it will prohibit following
> - Flat layout of Float16 arrays.
> - Creating valhalla specific IR needed for intrinsification.
>
> Here are the first baseline numbers without --enable-primitive.
>
>
> Benchmark (vectorDim) Mode Cnt Score Error Units
> Float16OpsBenchmark.absBenchmark 1024 thrpt 2 99.424 ops/ms
> Float16OpsBenchmark.addBenchmark 1024 thrpt 2 97.498 ops/ms
> Float16OpsBenchmark.cosineSimilarityDequantizedFP16 1024 thrpt 2 525.360 ops/ms
> Float16OpsBenchmark.cosineSimilarityDoubleRoundingFP16 1024 thrpt 2 51.132 ops/ms
> Float16OpsBenchmark.cosineSimilaritySingleRoundingFP16 1024 thrpt 2 46.921 ops/ms
> Float16OpsBenchmark.divBenchmark 1024 thrpt 2 97.186 ops/ms
> Float16OpsBenchmark.euclideanDistanceDequantizedFP16 1024 thrpt 2 583.051 ops/ms
> Float16OpsBenchmark.euclideanDistanceFP16 1024 thrpt 2 56.133 ops/ms
> Float16OpsBenchmark.fmaBenchmark 1024 thrpt 2 81.386 ops/ms
> Float16OpsBenchmark.getExponentBenchmark 1024 thrpt 2 2257.619 ops/ms
> Float16OpsBenchmark.isFiniteBenchmark 1024 thrpt 2 3086.476 ops/ms
> Float16OpsBenchmark.isInfiniteBenchmark 1024 thrpt 2 1718.411 ops/ms
> Float16OpsBenchmark.isNaNBenchmark 1024 thrpt 2 1685.557 ops/ms
> Float16OpsBenchmark.maxBenchma...
> @jatin-bhateja , Thanks! While we are at the topic, can I ask if there are any real-world usescases or workloads that you are targeting the FP16 work for and maybe plan to do performance testing in the future?
Hey, we have some ideas, but for now my intent is to add micros/few demonstrating macro for each API we have accelerated.
-------------
PR Comment: https://git.openjdk.org/valhalla/pull/1254#issuecomment-2378652894
More information about the valhalla-dev
mailing list