RFR: 8373026: C2 SuperWord and Vector API: vector algorithms test and benchmark [v12]

Thu Jan 22 02:26:33 UTC 2026

On Wed, 21 Jan 2026 10:01:08 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> This is an exploratory work. I wanted to use auto vectorization and the Vector API to implement some SIMD algorithms. We don't have too many IR tests and benchmarks, so I'm proposing an initial set of them, to be extended in the future.
>> 
>> Note: for now they are all `int` based. And some of them may not use the Vector API optimally, so feel free to propose ideas and integrate them in a follow-up RFE ;)
>> 
>> **Discussion**
>> 
>> Observations:
>> - If the loop can be auto vectorized, that is the fastest. If we cannot vectorize, we at least get reasonable scalar performance.
>> - If the Vector API code can be fully intrinsified, we get fast code. But somtimes, the Vector API is horribly slow, much slower than scalar loop performance.
>>   - `linux_aarch64_server`: `filterI`, `scanAddI`, `reduceAddIFieldsX4` are very slow
>>   - `macosx_aarch64`: `filterI`, `scanAddI`, `reduceAddIFieldsX4`, `findMinIndex` are very slow
>>   - `linux_x64_oci_server`: Vector API leads to really nice speedups
>>   - `windows_x64_oci_server`: the only one that gets good/better performance on all benchmarks
>>   - `macosx_x64_sandybridge`: `scanAddI`!, `reduceAddIFieldsX4` are very slow. Other benchmarks benefit.
>> - Compact Object Headers has some negative effect on some loop benchmarks.
>>   - `linux_aarch64_server`: `reduceAddI`, `copyI`
>>   - `macosx_aarch64`: `mapI`, `reduceAddI`, `copyI`
>>   - `linux_x64_oci_server`: `reduceAddI`, `copyI`, `findI`?
>>   - `windows_x64_oci_server`: `reduceAddI` and some others a little bit
>>   - `macosx_x64_sandybridge`: `fillI`, `iotaI`, `mapI`, `reduceAddI`, `copyI`
>> - Intrinsics can be much faster than auto vectoirzed or Vector API code.
>>   - `linux_aarch64_server`: `copyI`
>>   - `macosx_x64_sandybridge`: actually, `Arrays.fill` seems to suffer with Compact Object Headers as well.
>> - `rearrange` often needs to do the `mask load` and `and` operation inside the loop. That has a slight performance impact, I filed [JDK-8373240](https://bugs.openjdk.org/browse/JDK-8373240).
>> 
>> **Benchmark Plots**
>> 
>> Units: nanoseconds per algorithm invocation.
>> 
>> Note: the `aarch64` machines all only have `NEON` support. Performance may be much better on `SVE`, I have not benchmarked that yet.
>> 
>> `linux_x64_oci`
>> <img width="4500" height="6000" alt="algo_linux_x64_oci_server" src="https://github.com/user-attachments/assets/f2c5bbcb-e009-4c54-a1bf-91af45326cb9" />
>> 
>> `windows_x64_oci`
>> <img width="4500" height="6000" alt="algo_windows_x64_oci_server" src...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   updates for review

I ran the new tests on my ARM NEON machine with `-XX:MaxVectorSize=8`, and following tests crashed with the same error:

compiler/vectorization/TestVectorAlgorithms.java#noOptimizeFill
compiler/vectorization/TestVectorAlgorithms.java#noSuperWord
compiler/vectorization/TestVectorAlgorithms.java#vanilla

Here is the log:

Standard Output
---------------
CompileCommand: inline *VectorAlgorithmsImpl*.* bool inline = true
TestVM main() called - about to run tests in class compiler.vectorization.TestVectorAlgorithms
For random generator using seed: 5121565769469166450
To re-run test with same seed value please add "-Djdk.test.lib.random.seed=5121565769469166450" to command line.
  300  Phi  === 103 1050 302  [[ 399 299 ]]  #rawptr:BotPTR !jvms: IntVector::lanewiseTemplate @ bci:154 (line 798) Int64Vector::lanewise @ bci:3 (line 278) Int64Vector::lanewise @ bci:3 (line 43) IntVector::lanewise @ bci:43 (line 944) IntVector::add @ bci:5 (line 1406) VectorAlgorithmsImpl::findMinIndexI_VectorAPI @ bci:96 (line 563)
  300  Phi  === 103 1050 302  [[ 399 299 ]]  #rawptr:BotPTR !jvms: IntVector::lanewiseTemplate @ bci:154 (line 798) Int64Vector::lanewise @ bci:3 (line 278) Int64Vector::lanewise @ bci:3 (line 43) IntVector::lanewise @ bci:43 (line 944) IntVector::add @ bci:5 (line 1406) VectorAlgorithmsImpl::findMinIndexI_VectorAPI @ bci:96 (line 563)
   98  safePoint  === 101 0 401 0 0 99 905 402 403 404 282 0 0 0 0 908 909 912  [[ 100 575 675 ]]  !jvms: VectorAlgorithmsImpl::findMinIndexI_VectorAPI @ bci:113 (line 558)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (jdk-src/src/hotspot/share/opto/buildOopMap.cpp:371), pid=145228, tid=145250
#  assert(false) failed: there should be an oop in OopMap instead of a live raw oop at safepoint
#
# JRE version: OpenJDK Runtime Environment (27.0) (fastdebug build 27-internal-git-362f4c7acc8)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 27-internal-git-362f4c7acc8, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# V  [libjvm.so+0x72ae50]  OopFlow::build_oop_map(Node*, int, PhaseRegAlloc*, int*)+0xf80
#

And the VM options:

-ea -esa -Xmx768m -XX:UseSVE=0 -XX:MaxVectorSize=8  --add-modules=jdk.incubator.vector -XX:CompileCommand=inline,*VectorAlgorithmsImpl*::* -XX:-BackgroundCompilation -XX:CompileCommand=quiet

Could you please take a look? Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28639#issuecomment-3782145169