RFR: 8367158: C2: create better fill and copy benchmarks, taking alignment into account

Wed Nov 26 06:27:50 UTC 2025

On Wed, 26 Nov 2025 06:24:38 GMT, Francesco Nigro <duke at openjdk.org> wrote:

>> **Summary**
>> 
>> I created some `fill` and `copy` style benchmarks, covering both `arrays` and `MemorySegment`s.
>> Reasons for this benchmark:
>> - I want to compare auto-vectorization with intrinsics (array assembly style intrinsics, and MemorySegment java level special implementations). This allows us to see if some are slower than others, and if we can manage to improve the slower versions somehow in the future.
>> - There are some known issues we can demonstrate well with this benchmark:
>>   - Super-Unrolling: unrolling the vectoirzed loop gets us extra performance, but the exact factor may not be optimal yet for auto-vectorization.
>>   - Small iteration count loops: auto-vectorization can lead to slowdowns.
>> - Many benchmarks do not control for alignment. But that creates noise. I just go over all possible alignments, that should smooth out the noise.
>> - Most benchmarks do not control for 4k aliasing (x86 effect in store buffer). I make sure that load/stores are not a multiple of 4k bytes apart, so we can avoid the noise of that effect.
>> 
>> ----------------------------------------------------------------------
>> 
>> **Analysis based on this Benchmark**
>> 
>> Analysis done in this PR:
>> - Arrays: auto vectorization vs scalar loops performance
>> - Arrays: auto vectorization loops vs intrinsics
>> - MemorySegments: auto vectorization loops vs scalar loops vs `MemorySegment.fill/copy`
>> 
>> Future work:
>> - Investigate deeper, inspect assembly, etc.
>> - Impact of `-XX:SuperWordAutomaticAlignment=0` on small iteration count loops.
>> - Investigate effect of `-XX:-OptimizeFill`. It seems that the loops in this benchmark are not detected automatically, and so the array intrinsics are not used. Why?
>> - Investigate impact of `CompactObjectHeaders`. Does enabling/disabling change any performance?
>> - Investigate if adjusting the super-unrolling factor could improve performance for auto-vectorization: [JDK-8368061](https://bugs.openjdk.org/browse/JDK-8368061)
>> - Performance comparison with Graal.
>> 
>> ----------------------------------------------------------------------
>> 
>> **Array Benchmark: auto vectorization vs scalar**
>> 
>> We can see that for arrays, auto vectorization leads to minor regressions for sizes 1-32, and then generally auto vectorization is faster for larger sizes. And this is true for both `fill` and `copy`.
>> 
>> Strange: `macosx_aarch64` with `copy_int`. The auto vectoirized performance has a sudden drop around 150 iterations. Also for `fill_...
>
> test/micro/org/openjdk/bench/vm/compiler/VectorBulkOperationsArray.java line 155:
> 
>> 153: 
>> 154:     @CompilerControl(CompilerControl.Mode.INLINE)
>> 155:     public static int offsetLoad(int i) { return i % 64; }
> 
> it's a minor but `& 63`: since i is not proven to be positive, C2 doesn't strength reduce the modulus into the cheaper form (&).
> you can mask `i` stripping out the negative bits too, and should work the same.

Same applies elsewhere in the bench

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27315#discussion_r2563462420