RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts
Emanuel Peter
epeter at openjdk.org
Tue Nov 11 15:22:20 UTC 2025
On Mon, 10 Nov 2025 16:07:35 GMT, Fei Gao <fgao at openjdk.org> wrote:
>> @fg1417 Are you still working on this?
>
> Hi @eme64, many thanks for your review. It’s really comprehensive and insightful. I’ve given a thumbs-up to all the comments that have been resolved in this commit.
>
>> I have one concern: We now have changed the branches. There is now a long sequence of branches if we have very few iterations, so that we only go through pre and post loop. It would be interesting to see what the performance difference is between master and patch.
>
> Regarding this concern, I re-ran the microbenchmarks (now merged with the existing `VectorThroughputForIterationCount.java` ), named as `bench03*_drain_memoryBound`, and collected data across different platforms, including `128-bit` and `256-bit` `AArch64` machines as well as a `512-bit` `x86` machine.
>
> To summarize, I observe a minor performance regression for small-iteration loops on the `128-bit` and `256-bit` `AArch64` platforms. For larger-iteration loops, there is either a performance improvement or no noticeable change. The performance data on the `512-bit x86` machine shows a similar trend, though the regression is more significant.
>
> **The test range of `ITERATION_COUNT` is `0–300`. For larger `ITERATION_COUNT` values, there is either a performance improvement or no noticeable change, so those results are omitted. The following data only shows cases with regressions.**
>
>
> (FIXED_OFFSET) (RANDOMIZE_OFFSETS) (REPETITIONS) (seed) Mode Cnt
> 0 TRUE 1024 42 avgt 3
>
> `Diff = (patch - master) / master`
>
> On `128-bit aarch64` platform:
>
> Benchmark (ITERATION_COUNT) Units Diff
> bench031B_drain_memoryBound 1 ns/op 15.15%
> bench031B_drain_memoryBound 2 ns/op 10.89%
> bench031B_drain_memoryBound 3 ns/op 9.27%
> bench031B_drain_memoryBound 4 ns/op 7.39%
> bench031B_drain_memoryBound 5 ns/op 5.86%
> bench031B_drain_memoryBound 6 ns/op 5.31%
> bench031B_drain_memoryBound 7 ns/op 4.39%
> bench031B_drain_memoryBound 8 ns/op 4.27%
> bench031B_drain_memoryBound 9 ns/op 3.60%
> bench031B_drain_memoryBound 10 ns/op 3.11%
> bench031B_drain_memoryBound 11 ns/op 2.97%
> bench031B_drain_memoryBound 12 ns/op 3.19%
> bench031B_drain_memoryBound 13 ns/op 2.90%
> bench031B_drain_memoryBound 14 ns/op 2.68%
> bench031B_drain_memoryBound 15 ns/op 2.37%
> bench031B_drain_memoryBound 16 ns/op 2.44%
> bench031B_drain_memoryBound 17 ns/op 2.11%
> bench031B_drain_memoryBound 18 ns...
@fg1417 Thanks for benchmarking for my concern 😊 You plot from above probably shows exactly what I was expecting:
<img width="300" height="486" alt="image" src="https://github.com/user-attachments/assets/3a8b6fb7-0a1b-431a-885d-13023aacc3ea" />
Seeing your results, I also lean to the side that the results are acceptable: very minor losses, but a clear win in the middle.
I'll have a look at our smaller conversations now.
FYI: I'm generally really impressed how clean the results on your plots are :)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-3517416813
More information about the hotspot-compiler-dev
mailing list