RFR: 8367158: C2: create better fill and copy benchmarks, taking alignment into account
Emanuel Peter
epeter at openjdk.org
Wed Nov 26 06:12:55 UTC 2025
On Tue, 16 Sep 2025 14:28:12 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
> **Summary**
>
> I created some `fill` and `copy` style benchmarks, covering both `arrays` and `MemorySegment`s.
> Reasons for this benchmark:
> - I want to compare auto-vectorization with intrinsics (array assembly style intrinsics, and MemorySegment java level special implementations). This allows us to see if some are slower than others, and if we can manage to improve the slower versions somehow in the future.
> - There are some known issues we can demonstrate well with this benchmark:
> - Super-Unrolling: unrolling the vectoirzed loop gets us extra performance, but the exact factor may not be optimal yet for auto-vectorization.
> - Small iteration count loops: auto-vectorization can lead to slowdowns.
> - Many benchmarks do not control for alignment. But that creates noise. I just go over all possible alignments, that should smooth out the noise.
> - Most benchmarks do not control for 4k aliasing (x86 effect in store buffer). I make sure that load/stores are not a multiple of 4k bytes apart, so we can avoid the noise of that effect.
>
> ----------------------------------------------------------------------
>
> **Analysis based on this Benchmark**
>
> Analysis done in this PR:
> - Arrays: auto vectorization vs scalar loops performance
> - Arrays: auto vectorization loops vs intrinsics
> - MemorySegments: auto vectorization loops vs scalar loops vs `MemorySegment.fill/copy`
>
> Future work:
> - Investigate deeper, inspect assembly, etc.
> - Impact of `-XX:SuperWordAutomaticAlignment=0` on small iteration count loops.
> - Investigate effect of `-XX:-OptimizeFill`. It seems that the loops in this benchmark are not detected automatically, and so the array intrinsics are not used. Why?
> - Investigate impact of `CompactObjectHeaders`. Does enabling/disabling change any performance?
> - Investigate if adjusting the super-unrolling factor could improve performance for auto-vectorization: [JDK-8368061](https://bugs.openjdk.org/browse/JDK-8368061)
> - Performance comparison with Graal.
>
> ----------------------------------------------------------------------
>
> **Array Benchmark: auto vectorization vs scalar**
>
> We can see that for arrays, auto vectorization leads to minor regressions for sizes 1-32, and then generally auto vectorization is faster for larger sizes. And this is true for both `fill` and `copy`.
>
> Strange: `macosx_aarch64` with `copy_int`. The auto vectoirized performance has a sudden drop around 150 iterations. Also for `fill_long` we have a "phase-transition" around 64, that goes steeper rather...
Note: there are related benchmarks in https://github.com/openjdk/jdk/pull/28260, but they do not take the same approach to "randomize" alignment.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/27315#issuecomment-3579394031
More information about the hotspot-compiler-dev
mailing list