RFR: 8329077: C2 SuperWord: Add MoveD2L, MoveL2D, MoveF2I, MoveI2F

Mon Aug 4 15:55:56 UTC 2025

On Fri, 1 Aug 2025 12:52:21 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> I've added support to vectorize `MoveD2L`, `MoveL2D`, `MoveF2I` and `MoveI2F` nodes. The implementation follows a similar pattern to what is done with conversion (`Conv*`) nodes. The tests in `TestCompatibleUseDefTypeSize` have been updated with the new expectations.
>> 
>> Also added a JMH benchmark which measures throughput (the higher the number the better) for methods that exercise these nodes. On darwin/aarch64 it shows:
>> 
>> 
>> Benchmark                                (seed)  (size)   Mode  Cnt      Base      Patch   Units   Diff
>> VectorBitConversion.doubleToLongBits          0    2048  thrpt    8  1168.782   1157.717  ops/ms    -1%
>> VectorBitConversion.doubleToRawLongBits       0    2048  thrpt    8  3999.387   7353.936  ops/ms   +83%
>> VectorBitConversion.floatToIntBits            0    2048  thrpt    8  1200.338   1188.206  ops/ms    -1%
>> VectorBitConversion.floatToRawIntBits         0    2048  thrpt    8  4058.248  14792.474  ops/ms  +264%
>> VectorBitConversion.intBitsToFloat            0    2048  thrpt    8  3050.313  14984.246  ops/ms  +391%
>> VectorBitConversion.longBitsToDouble          0    2048  thrpt    8  3022.691   7379.360  ops/ms  +144%
>> 
>> 
>> The improvements observed are a result of vectorization. The lack of vectorization in `doubleToLongBits` and `floatToIntBits` demonstrates that these changes do not affect their performance. These methods do not vectorize because of flow control.
>> 
>> I've run the tier1-3 tests on linux/aarch64 and didn't observe any regressions.
>
> src/hotspot/share/opto/vectornode.cpp line 1831:
> 
>> 1829: 
>> 1830: bool VectorReinterpretNode::implemented(int opc, uint vlen, BasicType src_type, BasicType dst_type) {
>> 1831:   if ((src_type == T_FLOAT && dst_type == T_INT) ||
> 
> Just a suggestion, do you feel a `switch-case` could be more readable/clear in this case? Something like this -
> 
> 
> bool VectorReinterpretNode::implemented(uint vlen, BasicType src_type, BasicType dst_type) {
>   switch (src_type) {
>     case T_FLOAT:
>       if (dst_type != T_INT)    return false;
>       break;
>     case T_INT:
>       if (dst_type != T_FLOAT)  return false;
>       break;
>     case T_DOUBLE:
>       if (dst_type != T_LONG)   return false;
>       break;
>     case T_LONG:
>       if (dst_type != T_DOUBLE) return false;
>       break;
>     default:
>       return false;
>   }
>   return Matcher::match_rule_supported_auto_vectorization(Op_VectorReinterpret, vlen, dst_type);
> }

Both options look just fine to me, but I'm happy to re-write it like that if others also feel the same way.

> test/micro/org/openjdk/bench/java/lang/VectorBitConversion.java line 67:
> 
>> 65: 
>> 66:     @Benchmark
>> 67:     public long[] doubleToLongBits() {
> 
> Would something like this be more concise (and maybe more readable as well) - 
> 
> @Benchmark
> public long[] doubleToLongBits() {
>     for (int i = 0; i < doubles.length; i++) {
>         resultLongs[i] = Double.doubleToLongBits(doubles[i]);
>     }
>     return resultLongs;
> }
> 
> 
> The loop should still get vectorized (if vectorizable).
> 
> Same for other benchmarks.

Maybe but there's a reason why I wrote these benchmark methods this way. Keeping each line doing one thing makes it easier to map each line to the assembly (e.g. `perfasm`) and related IR nodes (e.g. `PrintIdeal`). That IMO is more important than the conciseness of the benchmark. What do others think?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2251893141
PR Review Comment: https://git.openjdk.org/jdk/pull/26457#discussion_r2251889884