RFR: 8294588: Auto vectorize half precision floating point conversion APIs [v7]

Wed Dec 7 03:56:00 UTC 2022

On Wed, 7 Dec 2022 00:46:44 GMT, Smita Kamath <svkamath at openjdk.org> wrote:

>> Hi All, 
>> 
>> I have added changes for autovectorizing Float.float16ToFloat and Float.floatToFloat16 API's.
>> Following are the performance numbers of JMH micro Fp16ConversionBenchmark:
>> Before code changes:
>> Benchmark | (size) | Mode | Cnt | Score | Error | Units
>> Fp16ConversionBenchmark.float16ToFloat | 2048 | thrpt | 3 | 1044.653 | ±     0.041 | ops/ms
>> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 | thrpt | 3 | 2341529.9 | ± 11765.453 | ops/ms
>> Fp16ConversionBenchmark.floatToFloat16 | 2048 | thrpt | 3 | 2156.662 | ±     0.653 | ops/ms
>> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt | 3 | 2007988.1 | ±   361.696 | ops/ms
>> 
>> After:
>> Benchmark | (size) | Mode |  Cnt | Score | Error |   Units
>> Fp16ConversionBenchmark.float16ToFloat  | 2048 | thrpt | 3 |  20460.349 |±  372.327 |  ops/ms
>> Fp16ConversionBenchmark.float16ToFloatMemory | 2048 |  thrpt | 3 | 2342125.200 |± 9250.899  |ops/ms
>> Fp16ConversionBenchmark.floatToFloat16  |  2048 | thrpt  |  3 |   22553.977 |±  483.034 | ops/ms
>> Fp16ConversionBenchmark.floatToFloat16Memory | 2048 | thrpt |  3 |  2007899.797 |±  150.296 | ops/ms
>> 
>> Kindly review and share your feedback.
>> 
>> Thanks.
>> Smita
>
> Smita Kamath has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updated test case

src/hotspot/share/opto/vectornode.cpp line 276:

> 274: 
> 275:   default:
> 276:     assert(!VectorNode::is_convert_opcode(sopc),

Hi @smita-kamath, you may need pay attention to the default line here, because the patch also adds the new opcodes in the function `is_convert_opcode()` below. BTW, superword has another specialized function for cast nodes, namely `VectorCastNode::opcode()`. 

The patch adds the new opcodes in the function `is_convert_opcode()`, which directs the code path go to the `VectorCastNode::implemented()` and then `VectorCastNode::opcode()` when it's determining if the vector opcode is implemented, see https://github.com/openjdk/jdk/blob/ce896731d38866c2bf99cd49525062e150d94160/src/hotspot/share/opto/superword.cpp#L2072. I suppose there is no problem here. But the patch doesn't specially handle the new opcodes in these two functions. So, in fact, it's determining if the platform supports `Op_VectorCastS2X` or `Op_VectorCastF2X` rather than `Op_VectorCastH2F` or `Op_VectorCastF2H`. Coincidentally, in the stage of vector node generation, namely `SuperWord::output()`, the patch adds the new opcodes before the branch for `is_convert_opcode()` and call the function `VectorNode::opcode()`, thus generating right vector nodes. Then, the vectorization succeeds. So, I’m afraid that there may be failures in some platforms which support `Op_VectorCastS2X` o
 r `Op_VectorCastF2X` but does not support `Op_VectorCastH2F` or `Op_VectorCastF2H`.

How about considering keeping consistent in these two stages, namely `SuperWord::implemented()` and `SuperWord::output()`. My suggestion is to keep uniform with other cast nodes. Thanks.

-------------

PR: https://git.openjdk.org/jdk/pull/11471