RFR: 8367158: C2: create better fill and copy benchmarks, taking alignment into account

Tue Nov 25 17:47:01 UTC 2025

**Summary**

I created some `fill` and `copy` style benchmarks, covering both `arrays` and `MemorySegment`s.
Reasons for this benchmark:
- I want to compare auto-vectorization with intrinsics (array assembly style intrinsics, and MemorySegment java level special implementations). This allows us to see if some are slower than others, and if we can manage to improve the slower versions somehow in the future.
- There are some known issues we can demonstrate well with this benchmark:
  - Super-Unrolling: unrolling the vectoirzed loop gets us extra performance, but the exact factor may not be optimal yet for auto-vectorization.
  - Small iteration count loops: auto-vectorization can lead to slowdowns.
- Many benchmarks do not control for alignment. But that creates noise. I just go over all possible alignments, that should smooth out the noise.
- Most benchmarks do not control for 4k aliasing (x86 effect in store buffer). I make sure that load/stores are not a multiple of 4k bytes apart, so we can avoid the noise of that effect.

----------------------------------------------------------------------

**Analysis based on this Benchmark**

Analysis done in this PR:
- Arrays: auto vectorization vs scalar loops performance
- Arrays: auto vectorization loops vs intrinsics
- MemorySegments: auto vectorization loops vs scalar loops vs `MemorySegment.fill/copy`

Future work:
- Investigate deeper, inspect assembly, etc.
- Impact of `-XX:SuperWordAutomaticAlignment=0` on small iteration count loops.
- Investigate effect of `-XX:-OptimizeFill`. It seems that the loops in this benchmark are not detected automatically, and so the array intrinsics are not used. Why?
- Investigate impact of `CompactObjectHeaders`. Does enabling/disabling change any performance?
- Investigate if adjusting the super-unrolling factor could improve performance for auto-vectorization: [JDK-8368061](https://bugs.openjdk.org/browse/JDK-8368061)
- Performance comparison with Graal.

----------------------------------------------------------------------

**Array Benchmark: auto vectorization vs scalar**

We can see that for arrays, auto vectorization leads to minor regressions for sizes 1-32, and then generally auto vectorization is faster for larger sizes. And this is true for both `fill` and `copy`.

Strange: `macosx_aarch64` with `copy_int`. The auto vectoirized performance has a sudden drop around 150 iterations. Also for `fill_long` we have a "phase-transition" around 64, that goes steeper rather than flatter.

`linux_x64_oci`
<img width="9000" height="1500" alt="arrays_sw_linux_x64_oci_server" src="https://github.com/user-attachments/assets/440c4bb2-ad76-47b0-a081-ce27857d8804" />

`windows_x64_oci`
<img width="9000" height="1500" alt="arrays_sw_windows_x64_oci_server" src="https://github.com/user-attachments/assets/1d196822-89dd-43b8-8f09-bc3874c8b5c9" />

`macosx_x64_sandybridge`
<img width="9000" height="1500" alt="arrays_sw_macosx_x64_sandybridge" src="https://github.com/user-attachments/assets/7d2ebfd2-104c-40ea-8ee2-8648f2e837c9" />

`linux_aarch64`
<img width="9000" height="1500" alt="arrays_sw_linux_aarch64_server" src="https://github.com/user-attachments/assets/858b4727-3aca-4990-a029-8105d4cef387" />

`macosx_aarch64`
<img width="9000" height="1500" alt="arrays_sw_macosx_aarch64" src="https://github.com/user-attachments/assets/5c4401f7-0dda-4c39-8696-961e96adeeee" />

----------------------------------------------------------------------

**Array Benchmark: auto vectorization vs intrinsics**

Observations:
- `linux_x64_oci` and `windows_x64_oci`:
  - `Objects`:
    - `System.arraycopy` has vectorized intrinsic, loop does not auto vectorize.
    - `Arrays.fill`: for `null` it seems to be fast between 0-70 elements, then slow. Why, and why don't we have faster intrinsics here? ❓ 
    - Null loop seems significantly faster than the others. Why? ❓ 
  - `byte`, `char`, `short`: all behave very similar.
    - Intrinsics perform very well, and have distinct "steps".
    - Auto vectorization loops are slower for all except 0 elements. That is not surprising at small iteration counts (0-150), see [JDK-8344085](https://bugs.openjdk.org/browse/JDK-8344085). But for larger iteration counts (150-300), it is probably due to something else, maybe unrolling factor? ❓ 
  - `int`, `long`, `float`, `double`: 4-byte and 8-byte types behave the same on both platforms.
    - copy: `linux_x64` consistently performs better with `System.arraycopy` (intrinsic) and worse with auto vectorization. But `windows_x64` has better auto vectorization for elements 0-50/100, and then performs getter with the intrinsic for larger sizes. In some cases the lines are parallel (just constant performance difference), in others the lines diverge (different unrolling factor?). I suspect we don't get consistent performance, one platform is probably AVX2 and the other AVX512. Investigate ❓ 
    - fill: strangely, the platforms are more consistent here. The intrinsics are a little faster in all cases, compared to auto vectorization. Investigate ❓ 
- `macosx_x64_sandybridge`: similar to `x64` platforms above, but a bit different because it has a different AVX support. Intrinsics are generally performing better, except for the fill null loop, just like for above.
- `aarch64`:
  - The plots look a little "cleaner", less noise. The performance is also less "zig-zag-y", especially with larger iteration counts.
  - `Object`:
    - copy: intrinsics are massively faster, of course no vectorization for loops.
    - fill: null cases are much faster, and intrinsic is a little faster still, but not much. But no fast intrinsic for variable fill. How can the intrinsic be so massively faster? ❓ 
  - Primitives:
    - copy: intrinsic is consistently solidly faster, except for 8-byte types: on one of the two platforms it looks that auto vectorization is only a bit slower for 0-250, and may even become faster above 300 iterations. Investigate ❓ 
    - fill:
      - 8-byte types: performance is identical for all versions.
      - 1-4 byte types:
        - `macosx_aarch64`: seems to have issues with the zero fill intrinsic: It has very eradic performance behaviour above 256 bytes. Investigate ❓ 
        - `linux_aarch64`: zero fill intrinsic: at first a little slower than var fill instinsic, but after about 400 bytes it becomes very significantly faster.
        - Auto vectorization is slower than the var fill intrinsic. Investigate ❓ 

The big questions from above:
- `x64` for `Objects`: What's up with the fill null intrinsic above 70 elements? Why is the intrinsic slower than the fill zero loop for more than 70 elements? Are we using 4 or 8 byte pointers?
- `x64` for `Primitives`: both intrinsic and loop vectorize - but why do we still see a performance difference, both for large and small iteration counts?
- `aarch64` for `Objects`: why are the copy intrinsics so massively faster compared to loop? It is more than what vectorization could explain, it seems.
- `aarch64` for `Primitives`: Why are intrinsics faster than auto vectorization, in many cases?
- `macosx_aarch64` eradic perf behaviour above 256 bytes, why?

`linux_x64_oci`
<img width="4500" height="9000" alt="arrays_linux_x64_oci" src="https://github.com/user-attachments/assets/76fab8b7-eb1e-49c5-9dbd-e30e515787bc" />

`windows_x64_oci`
<img width="4500" height="9000" alt="arrays_windows_x64_oci" src="https://github.com/user-attachments/assets/8b85860d-7a02-428f-bdb5-8e3cc8eb7dd7" />

`macosx_x64_sandybridge`
<img width="4500" height="9000" alt="arrays_macosx_x64_sandybridge" src="https://github.com/user-attachments/assets/42d6dbd3-5e26-4de9-add1-966c8857f5e1" />

`linux_aarch64`
<img width="4500" height="9000" alt="arrays_linux_aarch64" src="https://github.com/user-attachments/assets/2ca01f24-16eb-4dd7-841e-5195b31b228b" />

`macosx_aarch64`
<img width="4500" height="9000" alt="arrays_macosx_aarch64" src="https://github.com/user-attachments/assets/2e4f6848-b1ba-4a63-931c-6e5adb8246ee" />

----------------------------------------------------------------------

**Memory Segment Benchmark**

Quick analysis:
- Auto vectorization is quite a bit slower than the `MemorySegment.copy/fill`. But there are some strange performance behaviours on x64 machines. I suspect it has to do with memory alignment: `MemorySegment.fill` probably does not align memory, and so it gets penalized for split loads/stores.
- Just like with arrays: for small iteration counts (0-32) we get a regression with auto vectorization, compared to scalar performance.

`linux_x64_oci`
<img width="4500" height="6000" alt="ms_linux_x64_oci_server" src="https://github.com/user-attachments/assets/e3cf8951-7aa5-4433-952c-ab4ed1fba7ec" />

`windows_x64_oci`
<img width="4500" height="6000" alt="ms_windows_x64_oci_server" src="https://github.com/user-attachments/assets/77217009-7295-4fd2-958e-0f6158d63cc8" />

`macosx_x64_sandybridge`
<img width="4500" height="6000" alt="ms_macosx_x64_sandybridge" src="https://github.com/user-attachments/assets/b3105f9d-dc64-4910-8ab2-f8a4ada9f530" />

`linux_aarch64`
<img width="4500" height="6000" alt="ms_linux_aarch64_server" src="https://github.com/user-attachments/assets/9f1cec64-4a52-47cb-8e29-00a4033d458b" />

`macosx_aarch64`
<img width="4500" height="6000" alt="ms_macosx_aarch64" src="https://github.com/user-attachments/assets/86caa89d-e6b7-4844-b6f4-e3a9057a9578" />

-------------

Commit messages:
 - more MS types
 - fix MS fill
 - more backing types
 - object array benchmarks
 - fix bm
 - ms bm update
 - clean up benchmark
 - more types
 - improve benchmark
 - Merge branch 'master' into JDK-8367158-fill-and-copy-benchmarks
 - ... and 4 more: https://git.openjdk.org/jdk/compare/44964181...40a80d79

Changes: https://git.openjdk.org/jdk/pull/27315/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27315&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8367158
  Stats: 1055 lines in 2 files changed: 1055 ins; 0 del; 0 mod
  Patch: https://git.openjdk.org/jdk/pull/27315.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/27315/head:pull/27315

PR: https://git.openjdk.org/jdk/pull/27315