RFR: 8342095: Add autovectorizer support for subword vector casts [v3]

Fri May 2 05:22:47 UTC 2025

On Fri, 21 Mar 2025 09:43:14 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> @eme64 I think it should be good for another look over! I've addressed your review comments in the last commit.
>> 
>> About the potential for performance degradation, I think it would be unlikely since the code generated by the cast is quite small (as it only needs to truncate or sign-extend) and the patch increases the amount of possible code that can auto-vectorize. The one case that I can think of is that it might cause code that would be otherwise unprofitable to become vectorizable, but that would be because we don't have a cost model yet.
>
> @jaskarth Let me know if there is anything we can help you with here :)

@eme64 Thank you for the comments! I've updated the test and benchmark to be more exhaustive, and applied the suggested changes. For the benchmark, I got these results on my machine:

                                                  Baseline                    Patch
Benchmark                  (SIZE)  Mode  Cnt    Score    Error  Units    Score   Error  Units  Improvement
VectorSubword.byteToChar     1024  avgt   12  252.954 ±  4.129  ns/op   24.219 ± 0.453  ns/op  (10.4x)
VectorSubword.byteToInt      1024  avgt   12  194.707 ±  3.584  ns/op   38.353 ± 0.637  ns/op  (5.07x)
VectorSubword.byteToLong     1024  avgt   12   73.645 ±  1.418  ns/op   70.521 ± 0.470  ns/op  (no change)
VectorSubword.byteToShort    1024  avgt   12  252.647 ±  3.738  ns/op   22.664 ± 0.449  ns/op  (11.1x)
VectorSubword.charToByte     1024  avgt   12  236.396 ±  3.893  ns/op  228.710 ± 1.967  ns/op  (no change)
VectorSubword.charToInt      1024  avgt   12  179.673 ±  2.811  ns/op  173.764 ± 1.150  ns/op  (no change)
VectorSubword.charToLong     1024  avgt   12  184.867 ±  3.079  ns/op  177.999 ± 1.312  ns/op  (no change)
VectorSubword.charToShort    1024  avgt   12   24.385 ±  1.822  ns/op   22.375 ± 1.980  ns/op  (no change)
VectorSubword.intToByte      1024  avgt   12  190.949 ±  1.475  ns/op   49.376 ± 1.383  ns/op  (3.86x)
VectorSubword.intToChar      1024  avgt   12  182.862 ±  3.708  ns/op   44.344 ± 4.513  ns/op  (4.12x)
VectorSubword.intToLong      1024  avgt   12   76.072 ±  1.153  ns/op   73.382 ± 0.294  ns/op  (no change)
VectorSubword.intToShort     1024  avgt   12  184.362 ±  1.938  ns/op   45.556 ± 3.323  ns/op  (4.04x)
VectorSubword.longToByte     1024  avgt   12  150.766 ±  3.475  ns/op  146.651 ± 0.742  ns/op  (no change)
VectorSubword.longToChar     1024  avgt   12  121.764 ±  1.323  ns/op  117.068 ± 1.891  ns/op  (no change)
VectorSubword.longToInt      1024  avgt   12   83.761 ±  2.140  ns/op   82.084 ± 0.930  ns/op  (no change)
VectorSubword.longToShort    1024  avgt   12  132.293 ± 23.046  ns/op  115.883 ± 0.834  ns/op  (+ 12.4%)
VectorSubword.shortToByte    1024  avgt   12  253.387 ±  5.972  ns/op   27.591 ± 1.311  ns/op  (9.18x)
VectorSubword.shortToChar    1024  avgt   12   21.446 ±  1.914  ns/op   20.608 ± 1.593  ns/op  (no change)
VectorSubword.shortToInt     1024  avgt   12  187.109 ±  3.372  ns/op   36.818 ± 0.989  ns/op  (5.08x)
VectorSubword.shortToLong    1024  avgt   12   75.448 ±  0.930  ns/op   72.835 ± 0.507  ns/op  (no change)

Interestingly, even though the `longToType` methods are now vectorizable, the performance difference is very small. I think it could be because on my AVX2 machine, it can only process 4 elements per iteration and the overhead of the conversion is fairly high. It's interesting that the `longToInt` method is faster than the rest, I'm curious if something can be done on the backend to improve the speed. I think there could also be potential speed improvements on platforms with wider vectors, like AVX512.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2846335293