RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts

Emanuel Peter epeter at openjdk.org
Tue Nov 11 15:22:20 UTC 2025


On Mon, 10 Nov 2025 16:07:35 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> @fg1417 Are you still working on this?
>
> Hi @eme64, many thanks for your review. It’s really comprehensive and insightful. I’ve given a thumbs-up to all the comments that have been resolved in this commit.
> 
>> I have one concern: We now have changed the branches. There is now a long sequence of branches if we have very few iterations, so that we only go through pre and post loop. It would be interesting to see what the performance difference is between master and patch. 
> 
> Regarding this concern, I re-ran the microbenchmarks (now merged with the existing `VectorThroughputForIterationCount.java` ), named as `bench03*_drain_memoryBound`, and collected data across different platforms, including `128-bit` and `256-bit` `AArch64` machines as well as a `512-bit` `x86` machine.
> 
> To summarize, I observe a minor performance regression for small-iteration loops on the `128-bit` and `256-bit` `AArch64` platforms. For larger-iteration loops, there is either a performance improvement or no noticeable change. The performance data on the `512-bit x86` machine shows a similar trend, though the regression is more significant.
> 
> **The test range of `ITERATION_COUNT` is `0–300`. For larger `ITERATION_COUNT` values, there is either a performance improvement or no noticeable change, so those results are omitted. The following data only shows cases with regressions.**
> 
> 
> (FIXED_OFFSET)  (RANDOMIZE_OFFSETS)  (REPETITIONS)  (seed)  Mode  Cnt
>     0                TRUE                1024         42    avgt    3
> 
> `Diff = (patch - master) / master`
> 
> On `128-bit aarch64` platform:
> 
> Benchmark    (ITERATION_COUNT)    Units    Diff
> bench031B_drain_memoryBound    1    ns/op    15.15%
> bench031B_drain_memoryBound    2    ns/op    10.89%
> bench031B_drain_memoryBound    3    ns/op    9.27%
> bench031B_drain_memoryBound    4    ns/op    7.39%
> bench031B_drain_memoryBound    5    ns/op    5.86%
> bench031B_drain_memoryBound    6    ns/op    5.31%
> bench031B_drain_memoryBound    7    ns/op    4.39%
> bench031B_drain_memoryBound    8    ns/op    4.27%
> bench031B_drain_memoryBound    9    ns/op    3.60%
> bench031B_drain_memoryBound    10    ns/op    3.11%
> bench031B_drain_memoryBound    11    ns/op    2.97%
> bench031B_drain_memoryBound    12    ns/op    3.19%
> bench031B_drain_memoryBound    13    ns/op    2.90%
> bench031B_drain_memoryBound    14    ns/op    2.68%
> bench031B_drain_memoryBound    15    ns/op    2.37%
> bench031B_drain_memoryBound    16    ns/op    2.44%
> bench031B_drain_memoryBound    17    ns/op    2.11%
> bench031B_drain_memoryBound    18    ns...

@fg1417 Thanks for benchmarking for my concern 😊  You plot from above probably shows exactly what I was expecting:
<img width="300" height="486" alt="image" src="https://github.com/user-attachments/assets/3a8b6fb7-0a1b-431a-885d-13023aacc3ea" />

Seeing your results, I also lean to the side that the results are acceptable: very minor losses, but a clear win in the middle.

I'll have a look at our smaller conversations now.

FYI: I'm generally really impressed how clean the results on your plots are :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-3517416813


More information about the hotspot-compiler-dev mailing list