RFR: 8342095: Add autovectorizer support for subword vector casts [v3]

Fri May 2 06:53:46 UTC 2025

On Fri, 2 May 2025 05:19:41 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> @jaskarth Let me know if there is anything we can help you with here :)
>
> @eme64 Thank you for the comments! I've updated the test and benchmark to be more exhaustive, and applied the suggested changes. For the benchmark, I got these results on my machine:
> 
>                                                   Baseline                    Patch
> Benchmark                  (SIZE)  Mode  Cnt    Score    Error  Units    Score   Error  Units  Improvement
> VectorSubword.byteToChar     1024  avgt   12  252.954 ±  4.129  ns/op   24.219 ± 0.453  ns/op  (10.4x)
> VectorSubword.byteToInt      1024  avgt   12  194.707 ±  3.584  ns/op   38.353 ± 0.637  ns/op  (5.07x)
> VectorSubword.byteToLong     1024  avgt   12   73.645 ±  1.418  ns/op   70.521 ± 0.470  ns/op  (no change)
> VectorSubword.byteToShort    1024  avgt   12  252.647 ±  3.738  ns/op   22.664 ± 0.449  ns/op  (11.1x)
> VectorSubword.charToByte     1024  avgt   12  236.396 ±  3.893  ns/op  228.710 ± 1.967  ns/op  (no change)
> VectorSubword.charToInt      1024  avgt   12  179.673 ±  2.811  ns/op  173.764 ± 1.150  ns/op  (no change)
> VectorSubword.charToLong     1024  avgt   12  184.867 ±  3.079  ns/op  177.999 ± 1.312  ns/op  (no change)
> VectorSubword.charToShort    1024  avgt   12   24.385 ±  1.822  ns/op   22.375 ± 1.980  ns/op  (no change)
> VectorSubword.intToByte      1024  avgt   12  190.949 ±  1.475  ns/op   49.376 ± 1.383  ns/op  (3.86x)
> VectorSubword.intToChar      1024  avgt   12  182.862 ±  3.708  ns/op   44.344 ± 4.513  ns/op  (4.12x)
> VectorSubword.intToLong      1024  avgt   12   76.072 ±  1.153  ns/op   73.382 ± 0.294  ns/op  (no change)
> VectorSubword.intToShort     1024  avgt   12  184.362 ±  1.938  ns/op   45.556 ± 3.323  ns/op  (4.04x)
> VectorSubword.longToByte     1024  avgt   12  150.766 ±  3.475  ns/op  146.651 ± 0.742  ns/op  (no change)
> VectorSubword.longToChar     1024  avgt   12  121.764 ±  1.323  ns/op  117.068 ± 1.891  ns/op  (no change)
> VectorSubword.longToInt      1024  avgt   12   83.761 ±  2.140  ns/op   82.084 ± 0.930  ns/op  (no change)
> VectorSubword.longToShort    1024  avgt   12  132.293 ± 23.046  ns/op  115.883 ± 0.834  ns/op  (+ 12.4%)
> VectorSubword.shortToByte    1024  avgt   12  253.387 ±  5.972  ns/op   27.591 ± 1.311  ns/op  (9.18x)
> VectorSubword.shortToChar    1024  avgt   12   21.446 ±  1.914  ns/op   20.608 ± 1.593  ns/op  (no change)
> VectorSubword.shortToInt     1024  avgt   12  187.109 ±  3.372  ns/op   36.818 ± 0.989  ns/op  (5.08x)
> VectorSubword.shortToLong    1024  avgt   12   75.448 ±  0.930  ns/op   72.835 ± 0.507  ns/op  (no change)
> 
> Interestingly, eve...

@jaskarth This looks really solid now.

Another thought I had:
Generally, vectorization is faster because we use fewer instructions. Example: if you had 8 loads, 4 adds and 4 stores, you now have 2 loads, 1 add, and 1 store. The cost per operation is roughly the same, but there are fewer operations now, so that makes it faster. It is of course a little more complicated, but still a good heuristic.

But as soon as vectorization requires additional instructions, such as reductions (with shuffle inside) or your your subword conversions now, then that is additional cost.

Reductions are not always profitable with vectorization, sometimes the shuffles make it more expensive than the scalar loop. I wonder if there could be a similar edge case with these subword conversions, which might actually lead to a small regression. I'm not saying this should be a blocker here, but I'm interested in this for my future work on the cost model, we might have some interesting cases here that I'll want to evaluate:
https://bugs.openjdk.org/browse/JDK-8340093

If you can find any such regression case with subword casts, then it would be great if we could keep track of it, so I can try to address it with the cost model later :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2846499207