RFR: 8342095: Add autovectorizer support for subword vector casts [v3]
Emanuel Peter
epeter at openjdk.org
Fri May 2 06:53:46 UTC 2025
On Fri, 2 May 2025 05:19:41 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:
>> @jaskarth Let me know if there is anything we can help you with here :)
>
> @eme64 Thank you for the comments! I've updated the test and benchmark to be more exhaustive, and applied the suggested changes. For the benchmark, I got these results on my machine:
>
> Baseline Patch
> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement
> VectorSubword.byteToChar 1024 avgt 12 252.954 ± 4.129 ns/op 24.219 ± 0.453 ns/op (10.4x)
> VectorSubword.byteToInt 1024 avgt 12 194.707 ± 3.584 ns/op 38.353 ± 0.637 ns/op (5.07x)
> VectorSubword.byteToLong 1024 avgt 12 73.645 ± 1.418 ns/op 70.521 ± 0.470 ns/op (no change)
> VectorSubword.byteToShort 1024 avgt 12 252.647 ± 3.738 ns/op 22.664 ± 0.449 ns/op (11.1x)
> VectorSubword.charToByte 1024 avgt 12 236.396 ± 3.893 ns/op 228.710 ± 1.967 ns/op (no change)
> VectorSubword.charToInt 1024 avgt 12 179.673 ± 2.811 ns/op 173.764 ± 1.150 ns/op (no change)
> VectorSubword.charToLong 1024 avgt 12 184.867 ± 3.079 ns/op 177.999 ± 1.312 ns/op (no change)
> VectorSubword.charToShort 1024 avgt 12 24.385 ± 1.822 ns/op 22.375 ± 1.980 ns/op (no change)
> VectorSubword.intToByte 1024 avgt 12 190.949 ± 1.475 ns/op 49.376 ± 1.383 ns/op (3.86x)
> VectorSubword.intToChar 1024 avgt 12 182.862 ± 3.708 ns/op 44.344 ± 4.513 ns/op (4.12x)
> VectorSubword.intToLong 1024 avgt 12 76.072 ± 1.153 ns/op 73.382 ± 0.294 ns/op (no change)
> VectorSubword.intToShort 1024 avgt 12 184.362 ± 1.938 ns/op 45.556 ± 3.323 ns/op (4.04x)
> VectorSubword.longToByte 1024 avgt 12 150.766 ± 3.475 ns/op 146.651 ± 0.742 ns/op (no change)
> VectorSubword.longToChar 1024 avgt 12 121.764 ± 1.323 ns/op 117.068 ± 1.891 ns/op (no change)
> VectorSubword.longToInt 1024 avgt 12 83.761 ± 2.140 ns/op 82.084 ± 0.930 ns/op (no change)
> VectorSubword.longToShort 1024 avgt 12 132.293 ± 23.046 ns/op 115.883 ± 0.834 ns/op (+ 12.4%)
> VectorSubword.shortToByte 1024 avgt 12 253.387 ± 5.972 ns/op 27.591 ± 1.311 ns/op (9.18x)
> VectorSubword.shortToChar 1024 avgt 12 21.446 ± 1.914 ns/op 20.608 ± 1.593 ns/op (no change)
> VectorSubword.shortToInt 1024 avgt 12 187.109 ± 3.372 ns/op 36.818 ± 0.989 ns/op (5.08x)
> VectorSubword.shortToLong 1024 avgt 12 75.448 ± 0.930 ns/op 72.835 ± 0.507 ns/op (no change)
>
> Interestingly, eve...
@jaskarth This looks really solid now.
Another thought I had:
Generally, vectorization is faster because we use fewer instructions. Example: if you had 8 loads, 4 adds and 4 stores, you now have 2 loads, 1 add, and 1 store. The cost per operation is roughly the same, but there are fewer operations now, so that makes it faster. It is of course a little more complicated, but still a good heuristic.
But as soon as vectorization requires additional instructions, such as reductions (with shuffle inside) or your your subword conversions now, then that is additional cost.
Reductions are not always profitable with vectorization, sometimes the shuffles make it more expensive than the scalar loop. I wonder if there could be a similar edge case with these subword conversions, which might actually lead to a small regression. I'm not saying this should be a blocker here, but I'm interested in this for my future work on the cost model, we might have some interesting cases here that I'll want to evaluate:
https://bugs.openjdk.org/browse/JDK-8340093
If you can find any such regression case with subword casts, then it would be great if we could keep track of it, so I can try to address it with the cost model later :)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2846499207
More information about the hotspot-compiler-dev
mailing list