RFR: 8342095: Add autovectorizer support for subword vector casts [v12]
Jasmine Karthikeyan
jkarthikeyan at openjdk.org
Sat May 3 17:32:49 UTC 2025
On Sat, 3 May 2025 17:13:32 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:
>> Hi all,
>> This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine:
>>
>>
>> Baseline Patch
>> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement
>> VectorSubword.intToByte 1024 avgt 12 200.049 ± 19.787 ns/op 56.228 ± 3.535 ns/op (3.56x)
>> VectorSubword.intToShort 1024 avgt 12 179.826 ± 1.539 ns/op 43.332 ± 1.166 ns/op (4.15x)
>> VectorSubword.shortToByte 1024 avgt 12 245.580 ± 6.150 ns/op 29.757 ± 1.055 ns/op (8.25x)
>>
>>
>> I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated!
>
> Jasmine Karthikeyan has updated the pull request incrementally with one additional commit since the last revision:
>
> Whitespace and benchmark tweak
Thanks a lot for running the benchmark on your AVX512 machine! The results are very interesting, in the char cases it looks like we over-unroll the loop with SuperWord enabled even though we don't end up vectorizing the loop, fixing that could solve the slowdown. Since you mentioned the unroll amount was 32x, it might be unrolling to fill a vector (`512/sizeof(char) = 32`).
> Wait, but you seem to say that you want to support `casting to T_CHAR`. But is the issue not casting FROM char?
You are correct, I think that is my mistake. It looks like casting to char is supported because stores to both short and char become `StoreC`, but casting from char isn't supported because we have no `VectorCastC2X` node. I'll update the bug to make it more accurate.
I've also pushed a small commit to remove some extra whitespace and to make the benchmark run faster.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2848723503
More information about the hotspot-compiler-dev
mailing list