RFR: 8342095: Add autovectorizer support for subword vector casts [v15]

Fri Jan 23 09:12:27 UTC 2026

On Fri, 23 Jan 2026 06:19:20 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> Hi all,
>> This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine:
>> 
>> 
>>                                                   Baseline                    Patch
>> Benchmark                  (SIZE)  Mode  Cnt    Score    Error  Units   Score    Error  Units    Improvement
>> VectorSubword.intToByte      1024  avgt   12  200.049 ± 19.787  ns/op   56.228 ± 3.535  ns/op  (3.56x)
>> VectorSubword.intToShort     1024  avgt   12  179.826 ±  1.539  ns/op   43.332 ± 1.166  ns/op  (4.15x)
>> VectorSubword.shortToByte    1024  avgt   12  245.580 ±  6.150  ns/op   29.757 ± 1.055  ns/op  (8.25x)
>> 
>> 
>> I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated!
>
> Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 19 commits:
> 
>  - Fix whitespace
>  - Update tests after merge, apply changes from review
>  - Merge from master
>  - Update tests, cleanup logic
>  - Merge branch 'master' into vectorize-subword
>  - Check for AVX2 for byte/long conversions
>  - Whitespace and benchmark tweak
>  - Address more comments, make test and benchmark more exhaustive
>  - Merge from master
>  - Fix copyright after merge
>  - ... and 9 more: https://git.openjdk.org/jdk/compare/de6f35ef...13378368

@jaskarth Wow, I just realized how big the impact of this PR is, by the number of IR rules you were able to adjust. Very exciting!

I left quite a few comments below, but only 3 are about the VM code, so we are not far from the finish line :)

The rest is more about tracking future work. If you don't have the time to file the issues just let me know, and I can file some RFEs for tracking :)

One major improvement for the future, would be to track down the cases where we now cast from subword->int, then do int-ops, and cast int->subword. This loses us a factor of 2 or 4 with the vector length and introduces more ops we probably don't always need. But optimizing this could be quite a big task, so not a high priority. But we should file an issue for it for sure :)

src/hotspot/share/opto/superwordVTransformBuilder.cpp line 264:

> 262:     if (use_bt != def_bt && !p0->is_Convert() && VectorCastNode::is_supported_subword_cast(def_bt, use_bt, pack->size())) {
> 263:       VTransformNode* in = get_vtnode(pack_in->at(0));
> 264:       VTransformNode* cast = new (_vtransform.arena()) VTransformCastVectorNode(_vtransform, pack->size(), def_bt, use_bt);

I just noticed: above, we already handle a cast case, but use `VTransformElementWiseVectorNode`:
https://github.com/openjdk/jdk/pull/23413/files#diff-cd8469676c3f287680696b4dbd87fd02b765f2c9a249bd485c55613b15843435L213-L217

I'm not happy with using `VTransformElementWiseVectorNode` for some casts and `VTransformCastVectorNode` for others. So I see 2 options:
- Use `VTransformCastVectorNode` for both, refactor the code I linded.
- Somehow try to remove `VTransformCastVectorNode`, and use `VTransformElementWiseVectorNode` here. Do you think that would be possible?

src/hotspot/share/opto/vtransform.cpp line 1313:

> 1311:     }
> 1312: 
> 1313:     if (current_red->in_req(2)->isa_Vector() == nullptr && current_red->in_req(2)->isa_CastVector() == nullptr) {

Having `VTransformCastVectorNode` subtype from `VTransformVectorNode` would make this change unnecessary.

src/hotspot/share/opto/vtransform.hpp line 981:

> 979: };
> 980: 
> 981: class VTransformCastVectorNode : public VTransformNode {

I do wonder if we really need this one, or if we could just use the element-wise operator.

If it's too much work or even impossible: can we at least make it a subtype of `VTransformVectorNode`, analogue to how the `VTransformReinterpretVectorNode` does it?

test/hotspot/jtreg/compiler/c2/TestMinMaxSubword.java line 65:

> 63: 
> 64:     @Test
> 65:     @IR(applyIfCPUFeature = { "avx", "true" }, counts = { IRNode.VECTOR_CAST_I2S, IRNode.VECTOR_SIZE_ANY, ">0" })

I think you could get more precise vector size here as well, using `IRNode.VECTOR_SIZE + "min(max_int, max_short)"` as you did in the other test :)

test/hotspot/jtreg/compiler/loopopts/superword/TestReductions.java line 464:

> 462:         applyIf = {"AutoVectorizationOverrideProfitability", "> 0"})
> 463:     @IR(failOn = IRNode.LOAD_VECTOR_B,
> 464:         applyIf = {"AutoVectorizationOverrideProfitability", "= 0"})

Wow, I think I had not noticed this before! This is actually a great win already. Though we could still do better by not casting to int, and rather staying in byte.

I now filed
[JDK-8376176](https://bugs.openjdk.org/browse/JDK-8376176): C2 SuperWord: implement/improve subword reductions

test/hotspot/jtreg/compiler/vectorization/TestRotateByteAndShortVector.java line 122:

> 120:     @IR(counts = { IRNode.LOAD_VECTOR_B, IRNode.VECTOR_SIZE + "min(max_int, max_byte)", "> 0",
> 121:                    IRNode.ROTATE_LEFT_V, "> 0" },
> 122:         applyIfCPUFeature = {"avx512f", "true"})

We could also improve things here, right? Or is there a reason why we need to cast from and to int?

Do you agree that we should file an RFE to track this?

test/hotspot/jtreg/compiler/vectorization/TestSubwordTruncation.java line 77:

> 75:     @Test
> 76:     @IR(counts = { IRNode.LOAD_VECTOR_S, IRNode.VECTOR_SIZE + "min(max_int, max_short)", "> 0" },
> 77:         applyIfCPUFeatureOr = { "avx2", "true", "asimd", "true" })

And how about here? Could we optimize and remove the casts?

test/hotspot/jtreg/compiler/vectorization/runner/ArrayTypeConvertTest.java line 125:

> 123:     @Test
> 124:     @IR(failOn = {IRNode.STORE_VECTOR})
> 125:     // Subword vector casts with char do not work currently, see JDK-8349562.

Ah, you had already filed something about unsigned casts!
I think this is now a possible duplicate of:
[JDK-8375502](https://bugs.openjdk.org/browse/JDK-8375502) C2 SuperWord: implement unsigned casts

But the issues are linked, so just leave the comment as is :)

test/hotspot/jtreg/compiler/vectorization/runner/BasicShortOpTest.java line 216:

> 214: 
> 215:     @Test
> 216:     @IR(applyIfCPUFeature = { "avx", "true" }, counts = { IRNode.VECTOR_CAST_I2S, IRNode.VECTOR_SIZE_ANY, ">0" })

Can we make the size more precise, please? :)

I suspect we might be able to eventually implement this with a short min, rather than a int min?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/23413#pullrequestreview-3696382185
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720237565
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720256416
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720248998
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720264113
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720294642
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720302974
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720306727
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720320456
PR Review Comment: https://git.openjdk.org/jdk/pull/23413#discussion_r2720326664