RFR: 8342095: Add autovectorizer support for subword vector casts [v3]
Jasmine Karthikeyan
jkarthikeyan at openjdk.org
Fri May 2 05:22:47 UTC 2025
On Fri, 21 Mar 2025 09:43:14 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> @eme64 I think it should be good for another look over! I've addressed your review comments in the last commit.
>>
>> About the potential for performance degradation, I think it would be unlikely since the code generated by the cast is quite small (as it only needs to truncate or sign-extend) and the patch increases the amount of possible code that can auto-vectorize. The one case that I can think of is that it might cause code that would be otherwise unprofitable to become vectorizable, but that would be because we don't have a cost model yet.
>
> @jaskarth Let me know if there is anything we can help you with here :)
@eme64 Thank you for the comments! I've updated the test and benchmark to be more exhaustive, and applied the suggested changes. For the benchmark, I got these results on my machine:
Baseline Patch
Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement
VectorSubword.byteToChar 1024 avgt 12 252.954 ± 4.129 ns/op 24.219 ± 0.453 ns/op (10.4x)
VectorSubword.byteToInt 1024 avgt 12 194.707 ± 3.584 ns/op 38.353 ± 0.637 ns/op (5.07x)
VectorSubword.byteToLong 1024 avgt 12 73.645 ± 1.418 ns/op 70.521 ± 0.470 ns/op (no change)
VectorSubword.byteToShort 1024 avgt 12 252.647 ± 3.738 ns/op 22.664 ± 0.449 ns/op (11.1x)
VectorSubword.charToByte 1024 avgt 12 236.396 ± 3.893 ns/op 228.710 ± 1.967 ns/op (no change)
VectorSubword.charToInt 1024 avgt 12 179.673 ± 2.811 ns/op 173.764 ± 1.150 ns/op (no change)
VectorSubword.charToLong 1024 avgt 12 184.867 ± 3.079 ns/op 177.999 ± 1.312 ns/op (no change)
VectorSubword.charToShort 1024 avgt 12 24.385 ± 1.822 ns/op 22.375 ± 1.980 ns/op (no change)
VectorSubword.intToByte 1024 avgt 12 190.949 ± 1.475 ns/op 49.376 ± 1.383 ns/op (3.86x)
VectorSubword.intToChar 1024 avgt 12 182.862 ± 3.708 ns/op 44.344 ± 4.513 ns/op (4.12x)
VectorSubword.intToLong 1024 avgt 12 76.072 ± 1.153 ns/op 73.382 ± 0.294 ns/op (no change)
VectorSubword.intToShort 1024 avgt 12 184.362 ± 1.938 ns/op 45.556 ± 3.323 ns/op (4.04x)
VectorSubword.longToByte 1024 avgt 12 150.766 ± 3.475 ns/op 146.651 ± 0.742 ns/op (no change)
VectorSubword.longToChar 1024 avgt 12 121.764 ± 1.323 ns/op 117.068 ± 1.891 ns/op (no change)
VectorSubword.longToInt 1024 avgt 12 83.761 ± 2.140 ns/op 82.084 ± 0.930 ns/op (no change)
VectorSubword.longToShort 1024 avgt 12 132.293 ± 23.046 ns/op 115.883 ± 0.834 ns/op (+ 12.4%)
VectorSubword.shortToByte 1024 avgt 12 253.387 ± 5.972 ns/op 27.591 ± 1.311 ns/op (9.18x)
VectorSubword.shortToChar 1024 avgt 12 21.446 ± 1.914 ns/op 20.608 ± 1.593 ns/op (no change)
VectorSubword.shortToInt 1024 avgt 12 187.109 ± 3.372 ns/op 36.818 ± 0.989 ns/op (5.08x)
VectorSubword.shortToLong 1024 avgt 12 75.448 ± 0.930 ns/op 72.835 ± 0.507 ns/op (no change)
Interestingly, even though the `longToType` methods are now vectorizable, the performance difference is very small. I think it could be because on my AVX2 machine, it can only process 4 elements per iteration and the overhead of the conversion is fairly high. It's interesting that the `longToInt` method is faster than the rest, I'm curious if something can be done on the backend to improve the speed. I think there could also be potential speed improvements on platforms with wider vectors, like AVX512.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2846335293
More information about the hotspot-compiler-dev
mailing list