[vectorIntrinsics+fp16] RFR: 8365967: C2 compiler support for HalffloatVector operations supported by auto-vectorization flow [v3]

Jatin Bhateja jbhateja at openjdk.org
Thu Oct 2 04:48:39 UTC 2025


On Tue, 2 Sep 2025 13:38:14 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch extends VectorAPI inline expanders to infer Float16 vector IR based on the newly passed operType argument.
>> We intend to leverage the existing IR and backend implementation of auto-vectorized Float16 operations. 
>> Various HalffloatVector operators, namely ADD, SUB, MUL, DIV, MAX, MIN, and FMA, now emit FP16 ISA on x86 targets supporting AVX512-FP16 feature and AArch64 SVE targets.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix jtreg failures

Performance of JMH micros
System: Model name:                INTEL(R) XEON(R) PLATINUM 8581C CPU @ 2.10GHz


Baseline:
Benchmark                      (size)   Mode  Cnt      Score   Error   Units
Halffloat256Vector.ABS           1024  thrpt    2    366.995          ops/ms
Halffloat256Vector.ABSMasked     1024  thrpt    2    345.584          ops/ms
Halffloat256Vector.ACOS          1024  thrpt    2     61.402          ops/ms
Halffloat256Vector.ADD           1024  thrpt    2    259.029          ops/ms
Halffloat256Vector.ADDMasked     1024  thrpt    2    251.257          ops/ms
Halffloat256Vector.ASIN          1024  thrpt    2     61.191          ops/ms
Halffloat256Vector.ATAN          1024  thrpt    2     40.815          ops/ms
Halffloat256Vector.ATAN2         1024  thrpt    2     28.224          ops/ms
Halffloat256Vector.CBRT          1024  thrpt    2     43.547          ops/ms
Halffloat256Vector.COS           1024  thrpt    2     37.414          ops/ms
Halffloat256Vector.COSH          1024  thrpt    2     46.365          ops/ms
Halffloat256Vector.DIV           1024  thrpt    2    221.924          ops/ms
Halffloat256Vector.DIVMasked     1024  thrpt    2    240.560          ops/ms
Halffloat256Vector.EXP           1024  thrpt    2     52.344          ops/ms
Halffloat256Vector.EXPM1         1024  thrpt    2     48.346          ops/ms
Halffloat256Vector.FMA           1024  thrpt    2    206.324          ops/ms
Halffloat256Vector.FMAMasked     1024  thrpt    2    184.678          ops/ms
Halffloat256Vector.HYPOT         1024  thrpt    2     34.096          ops/ms
Halffloat256Vector.LOG           1024  thrpt    2     40.300          ops/ms
Halffloat256Vector.LOG10         1024  thrpt    2     38.886          ops/ms
Halffloat256Vector.LOG1P         1024  thrpt    2     36.438          ops/ms
Halffloat256Vector.MAX           1024  thrpt    2    266.337          ops/ms
Halffloat256Vector.MAXMasked     1024  thrpt    2    245.518          ops/ms
Halffloat256Vector.MIN           1024  thrpt    2    268.963          ops/ms
Halffloat256Vector.MINMasked     1024  thrpt    2    243.136          ops/ms
Halffloat256Vector.MUL           1024  thrpt    2    264.127          ops/ms
Halffloat256Vector.MULMasked     1024  thrpt    2    251.600          ops/ms
Halffloat256Vector.NEG           1024  thrpt    2    365.486          ops/ms
Halffloat256Vector.NEGMasked     1024  thrpt    2    357.070          ops/ms
Halffloat256Vector.POW           1024  thrpt    2     26.809          ops/ms
Halffloat256Vector.SIN           1024  thrpt    2     34.555          ops/ms
Halffloat256Vector.SINH          1024  thrpt    2     53.779          ops/ms
Halffloat256Vector.SQRT          1024  thrpt    2    130.811          ops/ms
Halffloat256Vector.SQRTMasked    1024  thrpt    2    192.628          ops/ms
Halffloat256Vector.SUB           1024  thrpt    2    262.521          ops/ms
Halffloat256Vector.SUBMasked     1024  thrpt    2    254.578          ops/ms
Halffloat256Vector.TAN           1024  thrpt    2     30.002          ops/ms
Halffloat256Vector.TANH          1024  thrpt    2     55.562          ops/ms
Halffloat256Vector.blend         1024  thrpt    2  28002.356          ops/ms

Withopt:-
Benchmark                      (size)   Mode  Cnt      Score   Error   Units
Halffloat256Vector.ABS           1024  thrpt    2  24048.638          ops/ms
Halffloat256Vector.ABSMasked     1024  thrpt    2  45085.707          ops/ms
Halffloat256Vector.ACOS          1024  thrpt    2     56.116          ops/ms
Halffloat256Vector.ADD           1024  thrpt    2  19623.250          ops/ms
Halffloat256Vector.ADDMasked     1024  thrpt    2  27462.171          ops/ms
Halffloat256Vector.ASIN          1024  thrpt    2     62.081          ops/ms
Halffloat256Vector.ATAN          1024  thrpt    2     41.352          ops/ms
Halffloat256Vector.ATAN2         1024  thrpt    2     29.173          ops/ms
Halffloat256Vector.CBRT          1024  thrpt    2     39.926          ops/ms
Halffloat256Vector.COS           1024  thrpt    2     37.151          ops/ms
Halffloat256Vector.COSH          1024  thrpt    2     48.309          ops/ms
Halffloat256Vector.DIV           1024  thrpt    2   2805.701          ops/ms
Halffloat256Vector.DIVMasked     1024  thrpt    2   2795.544          ops/ms
Halffloat256Vector.EXP           1024  thrpt    2     55.055          ops/ms
Halffloat256Vector.EXPM1         1024  thrpt    2     50.483          ops/ms
Halffloat256Vector.FMA           1024  thrpt    2  23280.064          ops/ms
Halffloat256Vector.FMAMasked     1024  thrpt    2  21828.932          ops/ms
Halffloat256Vector.HYPOT         1024  thrpt    2     34.266          ops/ms
Halffloat256Vector.LOG           1024  thrpt    2     42.158          ops/ms
Halffloat256Vector.LOG10         1024  thrpt    2     41.335          ops/ms
Halffloat256Vector.LOG1P         1024  thrpt    2     36.291          ops/ms
Halffloat256Vector.MAX           1024  thrpt    2  14960.348          ops/ms
Halffloat256Vector.MAXMasked     1024  thrpt    2  12585.642          ops/ms
Halffloat256Vector.MIN           1024  thrpt    2  14662.769          ops/ms
Halffloat256Vector.MINMasked     1024  thrpt    2  12327.769          ops/ms
Halffloat256Vector.MUL           1024  thrpt    2  27156.965          ops/ms
Halffloat256Vector.MULMasked     1024  thrpt    2  21349.555          ops/ms
Halffloat256Vector.NEG           1024  thrpt    2  24093.711          ops/ms
Halffloat256Vector.NEGMasked     1024  thrpt    2  26889.264          ops/ms
Halffloat256Vector.POW           1024  thrpt    2     27.028          ops/ms
Halffloat256Vector.SIN           1024  thrpt    2     34.280          ops/ms
Halffloat256Vector.SINH          1024  thrpt    2     55.049          ops/ms
Halffloat256Vector.SQRT          1024  thrpt    2   2491.596          ops/ms
Halffloat256Vector.SQRTMasked    1024  thrpt    2   2493.591          ops/ms
Halffloat256Vector.SUB           1024  thrpt    2  29664.499          ops/ms
Halffloat256Vector.SUBMasked     1024  thrpt    2  25384.305          ops/ms
Halffloat256Vector.TAN           1024  thrpt    2     29.754          ops/ms
Halffloat256Vector.TANH          1024  thrpt    2     55.933          ops/ms
Halffloat256Vector.blend         1024  thrpt    2  22681.727          ops/ms


**What is remaining?**

Functional validation
Through performance validation
New IR framework-based tests.
Microbenchmark for FP16-based dotproduct.

-------------

PR Comment: https://git.openjdk.org/panama-vector/pull/231#issuecomment-3359042772


More information about the panama-dev mailing list