RFR: 8342095: Add autovectorizer support for subword vector casts [v11]
Emanuel Peter
epeter at openjdk.org
Fri May 2 08:59:54 UTC 2025
On Fri, 2 May 2025 04:44:37 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:
>> Hi all,
>> This patch adds initial support for the autovectorizer to generate conversions between subword types. Currently, when superword sees two packs that have different basic types, it discards them and bails out of vectorization. This patch changes the behavior to ask the backend if a cast between the conflicting types is supported, and keeps the pack if it is. Later, when the `VTransform` graph is built, a synthetic cast is emitted when packs requiring casts are detected. Currently, only narrowing casts are supported as I wanted to re-use existing `VectorCastX2Y` logic for the initial version, but adding more conversions is simple and can be done with a subsequent RFE. I have attached a JMH benchmark and got these results on my Zen 3 machine:
>>
>>
>> Baseline Patch
>> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement
>> VectorSubword.intToByte 1024 avgt 12 200.049 ± 19.787 ns/op 56.228 ± 3.535 ns/op (3.56x)
>> VectorSubword.intToShort 1024 avgt 12 179.826 ± 1.539 ns/op 43.332 ± 1.166 ns/op (4.15x)
>> VectorSubword.shortToByte 1024 avgt 12 245.580 ± 6.150 ns/op 29.757 ± 1.055 ns/op (8.25x)
>>
>>
>> I've also added some IR tests and they pass on my linux x64 machine. Thoughts and reviews would be appreciated!
>
> Jasmine Karthikeyan has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits:
>
> - Address more comments, make test and benchmark more exhaustive
> - Merge from master
> - Fix copyright after merge
> - Fix copyright
> - Merge
> - Implement patch with VectorCastNode::implemented
> - Merge branch 'master' into vectorize-subword
> - Address comments from review, refactor test
> - Add new conversions to benchmark
> - Fix some tests that now vectorize
> - ... and 2 more: https://git.openjdk.org/jdk/compare/bd7c7789...8c00ef84
It seems the only difference is just the level of unrolling. 8x vs 32x. But no vectorization either way.
public class Test {
public static int SIZE = 1024;
public static byte[] bytes = new byte[SIZE];
public static char[] chars = new char[SIZE];
public static void main(String[] args) {
for (int i = 0; i < 10_000; i++) {
test();
}
}
public static void test() {
for (int i = 0; i < SIZE; i++) {
bytes[i] = (byte)chars[i];
}
}
}
`./java -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=printcompilation,Test::test -XX:+TraceLoopOpts -XX:-UseSuperWord Test.java`
And then it seems that the 32x unrolling leads to some interesting use of registers. I think that the issue is that first all loads are done, and we don't have enough regular registers, so we start pushing to `xmm` registers. And later move them back to regular registers. That creates a very long loop, and that is not very efficient 😬
And we somehow still don't allow vectorization of `LoadUS -> StoreB`.
@jaskarth Do you know why?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2846708541
More information about the hotspot-compiler-dev
mailing list