RFR: 8367158: C2: create better fill and copy benchmarks, taking alignment into account [v2]

Thu Dec 11 05:40:30 UTC 2025

On Wed, 3 Dec 2025 13:03:02 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> **Summary**
>> 
>> I created some `fill` and `copy` style benchmarks, covering both `arrays` and `MemorySegment`s.
>> Reasons for this benchmark:
>> - I want to compare auto-vectorization with intrinsics (array assembly style intrinsics, and MemorySegment java level special implementations). This allows us to see if some are slower than others, and if we can manage to improve the slower versions somehow in the future.
>> - There are some known issues we can demonstrate well with this benchmark:
>>   - Super-Unrolling: unrolling the vectoirzed loop gets us extra performance, but the exact factor may not be optimal yet for auto-vectorization.
>>   - Small iteration count loops: auto-vectorization can lead to slowdowns.
>> - Many benchmarks do not control for alignment. But that creates noise. I just go over all possible alignments, that should smooth out the noise.
>> - Most benchmarks do not control for 4k aliasing (x86 effect in store buffer). I make sure that load/stores are not a multiple of 4k bytes apart, so we can avoid the noise of that effect.
>> 
>> ----------------------------------------------------------------------
>> 
>> **Analysis based on this Benchmark**
>> 
>> Analysis done in this PR:
>> - Arrays: auto vectorization vs scalar loops performance
>> - Arrays: auto vectorization loops vs intrinsics
>> - MemorySegments: auto vectorization loops vs scalar loops vs `MemorySegment.fill/copy`
>> 
>> Future work:
>> - Investigate deeper, inspect assembly, etc.
>> - Impact of `-XX:SuperWordAutomaticAlignment=0` on small iteration count loops.
>> - Investigate effect of `-XX:-OptimizeFill`. It seems that the loops in this benchmark are not detected automatically, and so the array intrinsics are not used. Why?
>> - Investigate impact of `CompactObjectHeaders`. Does enabling/disabling change any performance?
>> - Investigate if adjusting the super-unrolling factor could improve performance for auto-vectorization: [JDK-8368061](https://bugs.openjdk.org/browse/JDK-8368061)
>> - Performance comparison with Graal.
>> 
>> ----------------------------------------------------------------------
>> 
>> **Array Benchmark: auto vectorization vs scalar**
>> 
>> We can see that for arrays, auto vectorization leads to minor regressions for sizes 1-32, and then generally auto vectorization is faster for larger sizes. And this is true for both `fill` and `copy`.
>> 
>> Strange: `macosx_aarch64` with `copy_int`. The auto vectoirized performance has a sudden drop around 150 iterations. Also for `fill_...
>
> Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 16 additional commits since the last revision:
> 
>  - small modulo fix from review suggestion
>  - Merge branch 'master' into JDK-8367158-fill-and-copy-benchmarks
>  - more MS types
>  - fix MS fill
>  - more backing types
>  - object array benchmarks
>  - fix bm
>  - ms bm update
>  - clean up benchmark
>  - more types
>  - ... and 6 more: https://git.openjdk.org/jdk/compare/e6497e63...80378aea

test/micro/org/openjdk/bench/vm/compiler/VectorBulkOperationsArray.java line 61:

> 59: @Fork(value = 1)
> 60: public class VectorBulkOperationsArray {
> 61:     @Param({  "0",  "1",  "2",  "3",  "4",  "5",  "6",  "7",  "8",  "9",

How about larger values?

test/micro/org/openjdk/bench/vm/compiler/VectorBulkOperationsArray.java line 114:

> 112:     public static final int REGION_2_OBJECT_OFFSET = REGION_2_BYTE_OFFSET / 8;
> 113: 
> 114:     // The arrays with the two regions each

Is there a reason you don't want to have 2 arrays, one as `dst` and one as `src`?

test/micro/org/openjdk/bench/vm/compiler/VectorBulkOperationsArray.java line 202:

> 200: 
> 201:     @Benchmark
> 202:     public void fill_zero_byte_loop() {

Should these benchmarks be annotated with `@OperationsPerInvocation`?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27315#discussion_r2609223430
PR Review Comment: https://git.openjdk.org/jdk/pull/27315#discussion_r2609214012
PR Review Comment: https://git.openjdk.org/jdk/pull/27315#discussion_r2609207019