[vectorIntrinsics+fp16] RFR: 8365967: C2 compiler support for HalffloatVector operations supported by auto-vectorization flow [v3]
Jatin Bhateja
jbhateja at openjdk.org
Thu Oct 2 04:48:39 UTC 2025
On Tue, 2 Sep 2025 13:38:14 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Hi All,
>>
>> This patch extends VectorAPI inline expanders to infer Float16 vector IR based on the newly passed operType argument.
>> We intend to leverage the existing IR and backend implementation of auto-vectorized Float16 operations.
>> Various HalffloatVector operators, namely ADD, SUB, MUL, DIV, MAX, MIN, and FMA, now emit FP16 ISA on x86 targets supporting AVX512-FP16 feature and AArch64 SVE targets.
>>
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Fix jtreg failures
Performance of JMH micros
System: Model name: INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz
Baseline:
Benchmark (size) Mode Cnt Score Error Units
Halffloat256Vector.ABS 1024 thrpt 2 366.995 ops/ms
Halffloat256Vector.ABSMasked 1024 thrpt 2 345.584 ops/ms
Halffloat256Vector.ACOS 1024 thrpt 2 61.402 ops/ms
Halffloat256Vector.ADD 1024 thrpt 2 259.029 ops/ms
Halffloat256Vector.ADDMasked 1024 thrpt 2 251.257 ops/ms
Halffloat256Vector.ASIN 1024 thrpt 2 61.191 ops/ms
Halffloat256Vector.ATAN 1024 thrpt 2 40.815 ops/ms
Halffloat256Vector.ATAN2 1024 thrpt 2 28.224 ops/ms
Halffloat256Vector.CBRT 1024 thrpt 2 43.547 ops/ms
Halffloat256Vector.COS 1024 thrpt 2 37.414 ops/ms
Halffloat256Vector.COSH 1024 thrpt 2 46.365 ops/ms
Halffloat256Vector.DIV 1024 thrpt 2 221.924 ops/ms
Halffloat256Vector.DIVMasked 1024 thrpt 2 240.560 ops/ms
Halffloat256Vector.EXP 1024 thrpt 2 52.344 ops/ms
Halffloat256Vector.EXPM1 1024 thrpt 2 48.346 ops/ms
Halffloat256Vector.FMA 1024 thrpt 2 206.324 ops/ms
Halffloat256Vector.FMAMasked 1024 thrpt 2 184.678 ops/ms
Halffloat256Vector.HYPOT 1024 thrpt 2 34.096 ops/ms
Halffloat256Vector.LOG 1024 thrpt 2 40.300 ops/ms
Halffloat256Vector.LOG10 1024 thrpt 2 38.886 ops/ms
Halffloat256Vector.LOG1P 1024 thrpt 2 36.438 ops/ms
Halffloat256Vector.MAX 1024 thrpt 2 266.337 ops/ms
Halffloat256Vector.MAXMasked 1024 thrpt 2 245.518 ops/ms
Halffloat256Vector.MIN 1024 thrpt 2 268.963 ops/ms
Halffloat256Vector.MINMasked 1024 thrpt 2 243.136 ops/ms
Halffloat256Vector.MUL 1024 thrpt 2 264.127 ops/ms
Halffloat256Vector.MULMasked 1024 thrpt 2 251.600 ops/ms
Halffloat256Vector.NEG 1024 thrpt 2 365.486 ops/ms
Halffloat256Vector.NEGMasked 1024 thrpt 2 357.070 ops/ms
Halffloat256Vector.POW 1024 thrpt 2 26.809 ops/ms
Halffloat256Vector.SIN 1024 thrpt 2 34.555 ops/ms
Halffloat256Vector.SINH 1024 thrpt 2 53.779 ops/ms
Halffloat256Vector.SQRT 1024 thrpt 2 130.811 ops/ms
Halffloat256Vector.SQRTMasked 1024 thrpt 2 192.628 ops/ms
Halffloat256Vector.SUB 1024 thrpt 2 262.521 ops/ms
Halffloat256Vector.SUBMasked 1024 thrpt 2 254.578 ops/ms
Halffloat256Vector.TAN 1024 thrpt 2 30.002 ops/ms
Halffloat256Vector.TANH 1024 thrpt 2 55.562 ops/ms
Halffloat256Vector.blend 1024 thrpt 2 28002.356 ops/ms
Withopt:-
Benchmark (size) Mode Cnt Score Error Units
Halffloat256Vector.ABS 1024 thrpt 2 24048.638 ops/ms
Halffloat256Vector.ABSMasked 1024 thrpt 2 45085.707 ops/ms
Halffloat256Vector.ACOS 1024 thrpt 2 56.116 ops/ms
Halffloat256Vector.ADD 1024 thrpt 2 19623.250 ops/ms
Halffloat256Vector.ADDMasked 1024 thrpt 2 27462.171 ops/ms
Halffloat256Vector.ASIN 1024 thrpt 2 62.081 ops/ms
Halffloat256Vector.ATAN 1024 thrpt 2 41.352 ops/ms
Halffloat256Vector.ATAN2 1024 thrpt 2 29.173 ops/ms
Halffloat256Vector.CBRT 1024 thrpt 2 39.926 ops/ms
Halffloat256Vector.COS 1024 thrpt 2 37.151 ops/ms
Halffloat256Vector.COSH 1024 thrpt 2 48.309 ops/ms
Halffloat256Vector.DIV 1024 thrpt 2 2805.701 ops/ms
Halffloat256Vector.DIVMasked 1024 thrpt 2 2795.544 ops/ms
Halffloat256Vector.EXP 1024 thrpt 2 55.055 ops/ms
Halffloat256Vector.EXPM1 1024 thrpt 2 50.483 ops/ms
Halffloat256Vector.FMA 1024 thrpt 2 23280.064 ops/ms
Halffloat256Vector.FMAMasked 1024 thrpt 2 21828.932 ops/ms
Halffloat256Vector.HYPOT 1024 thrpt 2 34.266 ops/ms
Halffloat256Vector.LOG 1024 thrpt 2 42.158 ops/ms
Halffloat256Vector.LOG10 1024 thrpt 2 41.335 ops/ms
Halffloat256Vector.LOG1P 1024 thrpt 2 36.291 ops/ms
Halffloat256Vector.MAX 1024 thrpt 2 14960.348 ops/ms
Halffloat256Vector.MAXMasked 1024 thrpt 2 12585.642 ops/ms
Halffloat256Vector.MIN 1024 thrpt 2 14662.769 ops/ms
Halffloat256Vector.MINMasked 1024 thrpt 2 12327.769 ops/ms
Halffloat256Vector.MUL 1024 thrpt 2 27156.965 ops/ms
Halffloat256Vector.MULMasked 1024 thrpt 2 21349.555 ops/ms
Halffloat256Vector.NEG 1024 thrpt 2 24093.711 ops/ms
Halffloat256Vector.NEGMasked 1024 thrpt 2 26889.264 ops/ms
Halffloat256Vector.POW 1024 thrpt 2 27.028 ops/ms
Halffloat256Vector.SIN 1024 thrpt 2 34.280 ops/ms
Halffloat256Vector.SINH 1024 thrpt 2 55.049 ops/ms
Halffloat256Vector.SQRT 1024 thrpt 2 2491.596 ops/ms
Halffloat256Vector.SQRTMasked 1024 thrpt 2 2493.591 ops/ms
Halffloat256Vector.SUB 1024 thrpt 2 29664.499 ops/ms
Halffloat256Vector.SUBMasked 1024 thrpt 2 25384.305 ops/ms
Halffloat256Vector.TAN 1024 thrpt 2 29.754 ops/ms
Halffloat256Vector.TANH 1024 thrpt 2 55.933 ops/ms
Halffloat256Vector.blend 1024 thrpt 2 22681.727 ops/ms
**What is remaining?**
Functional validation
Through performance validation
New IR framework-based tests.
Microbenchmark for FP16-based dotproduct.
-------------
PR Comment: https://git.openjdk.org/panama-vector/pull/231#issuecomment-3359042772
More information about the panama-dev
mailing list