RFR: 8373026: C2 SuperWord and Vector API: vector algorithms test and benchmark [v2]
Emanuel Peter
epeter at openjdk.org
Wed Jan 14 07:57:47 UTC 2026
> This is an exploratory work. I wanted to use auto vectorization and the Vector API to implement some SIMD algorithms. We don't have too many IR tests and benchmarks, so I'm proposing an initial set of them, to be extended in the future.
>
> Note: for now they are all `int` based. And some of them may not use the Vector API optimally, so feel free to propose ideas and integrate them in a follow-up RFE ;)
>
> **Discussion**
>
> Observations:
> - If the loop can be auto vectorized, that is the fastest. If we cannot vectorize, we at least get reasonable scalar performance.
> - If the Vector API code can be fully intrinsified, we get fast code. But somtimes, the Vector API is horribly slow, much slower than scalar loop performance.
> - `linux_aarch64_server`: `filterI`, `scanAddI`, `reduceAddIFieldsX4` are very slow
> - `macosx_aarch64`: `filterI`, `scanAddI`, `reduceAddIFieldsX4`, `findMinIndex` are very slow
> - `linux_x64_oci_server`: Vector API leads to really nice speedups
> - `windows_x64_oci_server`: the only one that gets good/better performance on all benchmarks
> - `macosx_x64_sandybridge`: `scanAddI`!, `reduceAddIFieldsX4` are very slow. Other benchmarks benefit.
> - Compact Object Headers has some negative effect on some loop benchmarks.
> - `linux_aarch64_server`: `reduceAddI`, `copyI`
> - `macosx_aarch64`: `mapI`, `reduceAddI`, `copyI`
> - `linux_x64_oci_server`: `reduceAddI`, `copyI`, `findI`?
> - `windows_x64_oci_server`: `reduceAddI` and some others a little bit
> - `macosx_x64_sandybridge`: `fillI`, `iotaI`, `mapI`, `reduceAddI`, `copyI`
> - Intrinsics can be much faster than auto vectoirzed or Vector API code.
> - `linux_aarch64_server`: `copyI`
> - `macosx_x64_sandybridge`: actually, `Arrays.fill` seems to suffer with Compact Object Headers as well.
> - `rearrange` often needs to do the `mask load` and `and` operation inside the loop. That has a slight performance impact, I filed [JDK-8373240](https://bugs.openjdk.org/browse/JDK-8373240).
>
> **Benchmark Plots**
>
> Units: nanoseconds per algorithm invocation.
>
> `linux_x64_oci`
> <img width="4500" height="6000" alt="algo_linux_x64_oci_server" src="https://github.com/user-attachments/assets/f2c5bbcb-e009-4c54-a1bf-91af45326cb9" />
>
> `windows_x64_oci`
> <img width="4500" height="6000" alt="algo_windows_x64_oci_server" src="https://github.com/user-attachments/assets/8946d248-4d75-4b16-8f17-627a90dcb6c3" />
>
> `macosx_x64_sandybridge`
> <img width="4500" height="6000" alt="algo_macosx_x64_sandybridge" src="https://github.com/user...
Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 30 additional commits since the last revision:
- Merge branch 'master' into JDK-8373026-vector-algorithms
- Merge branch 'master' into JDK-8373026-vector-algorithms
- another IR rule fix
- more small fixes and comments
- more IR rules
- wip more IR rules
- improve IR rules
- gather benchmark
- gather test
- filterI
- ... and 20 more: https://git.openjdk.org/jdk/compare/d9a39a6f...c057462b
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/28639/files
- new: https://git.openjdk.org/jdk/pull/28639/files/40c51e8f..c057462b
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=28639&range=01
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=28639&range=00-01
Stats: 67519 lines in 3202 files changed: 34079 ins; 12249 del; 21191 mod
Patch: https://git.openjdk.org/jdk/pull/28639.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/28639/head:pull/28639
PR: https://git.openjdk.org/jdk/pull/28639
More information about the hotspot-compiler-dev
mailing list