RFR: 8307084: C2: Vectorized drain loop is not executed for some small trip counts
Fei Gao
fgao at openjdk.org
Mon Nov 10 16:09:54 UTC 2025
On Fri, 17 Oct 2025 13:05:47 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>>> BTW: I just integrated https://github.com/openjdk/jdk/pull/24278 which may have silent merge conflicts, so it would be good if you merged and tested again.
>>
>> Hi @eme64 , I’ve rebased the patch onto the latest JDK, and all tier1 to tier3 tests have passed on my local AArch64 and x86 machines.
>>
>>> It would be good if you re-ran the benchmarks. It seems the last ones you did in December of 2024.
>> We should see that we have various benchmarks, both for array and MemorySegment.
>> You could look at the array benchmarks from here: https://github.com/openjdk/jdk/pull/22070
>>
>> I also re-verified the benchmark from [PR #22070](https://github.com/openjdk/jdk/pull/22070) on 128-bit, 256-bit, and 512-bit vector machines. The results show no significant regressions and performance changes are consistent with the previous round described in [perf results]( https://bugs.openjdk.org/browse/JDK-8307084?focusedId=14729524&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14729524).
>>
>>> Once you do that I could also run some internal testing, if you like :)
>>
>> I’d really appreciate it if you could run some internal testing at a time you think is suitable.
>> Thanks :)
>
> @fg1417 Are you still working on this?
Hi @eme64, many thanks for your review. It’s really comprehensive and insightful. I’ve given a thumbs-up to all the comments that have been resolved in this commit.
> I have one concern: We now have changed the branches. There is now a long sequence of branches if we have very few iterations, so that we only go through pre and post loop. It would be interesting to see what the performance difference is between master and patch.
Regarding this concern, I re-ran the microbenchmarks (now merged with the existing `VectorThroughputForIterationCount.java` ), named as `bench03*_drain_memoryBound`, and collected data across different platforms, including `128-bit` and `256-bit` `AArch64` machines as well as a `512-bit` `x86` machine.
To summarize, I observe a minor performance regression for small-iteration loops on the `128-bit` and `256-bit` `AArch64` platforms. For larger-iteration loops, there is either a performance improvement or no noticeable change. The performance data on the `512-bit x86` machine shows a similar trend, though the regression is more significant.
**The test range of `ITERATION_COUNT` is `0–300`. For larger `ITERATION_COUNT` values, there is either a performance improvement or no noticeable change, so those results are omitted. The following data only shows cases with regressions.**
(FIXED_OFFSET) (RANDOMIZE_OFFSETS) (REPETITIONS) (seed) Mode Cnt
0 TRUE 1024 42 avgt 3
`Diff = (patch - master) / master`
On `128-bit aarch64` platform:
Benchmark (ITERATION_COUNT) Units Diff
bench031B_drain_memoryBound 1 ns/op 15.15%
bench031B_drain_memoryBound 2 ns/op 10.89%
bench031B_drain_memoryBound 3 ns/op 9.27%
bench031B_drain_memoryBound 4 ns/op 7.39%
bench031B_drain_memoryBound 5 ns/op 5.86%
bench031B_drain_memoryBound 6 ns/op 5.31%
bench031B_drain_memoryBound 7 ns/op 4.39%
bench031B_drain_memoryBound 8 ns/op 4.27%
bench031B_drain_memoryBound 9 ns/op 3.60%
bench031B_drain_memoryBound 10 ns/op 3.11%
bench031B_drain_memoryBound 11 ns/op 2.97%
bench031B_drain_memoryBound 12 ns/op 3.19%
bench031B_drain_memoryBound 13 ns/op 2.90%
bench031B_drain_memoryBound 14 ns/op 2.68%
bench031B_drain_memoryBound 15 ns/op 2.37%
bench031B_drain_memoryBound 16 ns/op 2.44%
bench031B_drain_memoryBound 17 ns/op 2.11%
bench031B_drain_memoryBound 18 ns/op 1.57%
bench031B_drain_memoryBound 19 ns/op 1.32%
bench031B_drain_memoryBound 20 ns/op 1.31%
bench031B_drain_memoryBound 21 ns/op 1.32%
bench031B_drain_memoryBound 22 ns/op 1.22%
bench031B_drain_memoryBound 23 ns/op 0.88%
bench031B_drain_memoryBound 24 ns/op 0.98%
bench031B_drain_memoryBound 25 ns/op 1.14%
bench031B_drain_memoryBound 26 ns/op 0.93%
bench031B_drain_memoryBound 27 ns/op 0.84%
bench031B_drain_memoryBound 28 ns/op 0.87%
bench031B_drain_memoryBound 29 ns/op 0.96%
bench031B_drain_memoryBound 30 ns/op 0.82%
bench032S_drain_memoryBound 1 ns/op 15.17%
bench032S_drain_memoryBound 2 ns/op 5.01%
bench032S_drain_memoryBound 3 ns/op 8.95%
bench032S_drain_memoryBound 4 ns/op 7.77%
bench032S_drain_memoryBound 5 ns/op 0.52%
bench032S_drain_memoryBound 6 ns/op -0.67%
bench032S_drain_memoryBound 7 ns/op 4.05%
bench032S_drain_memoryBound 8 ns/op 3.67%
bench032S_drain_memoryBound 9 ns/op -2.89%
bench032S_drain_memoryBound 10 ns/op 2.04%
bench032S_drain_memoryBound 11 ns/op -4.50%
bench032S_drain_memoryBound 12 ns/op -3.11%
bench032S_drain_memoryBound 13 ns/op 1.43%
bench032S_drain_memoryBound 14 ns/op -4.16%
bench032S_drain_memoryBound 15 ns/op -3.80%
bench034I_drain_memoryBound 1 ns/op 15.15%
bench034I_drain_memoryBound 2 ns/op 10.52%
bench034I_drain_memoryBound 3 ns/op 9.04%
bench034I_drain_memoryBound 4 ns/op 7.94%
bench034I_drain_memoryBound 5 ns/op 6.78%
bench034I_drain_memoryBound 6 ns/op 4.12%
bench034I_drain_memoryBound 7 ns/op 3.82%
bench035L_drain_memoryBound 1 ns/op 12.50%
bench035L_drain_memoryBound 2 ns/op 10.57%
bench035L_drain_memoryBound 3 ns/op 9.11%
bench035L_drain_memoryBound 4 ns/op 7.50%
bench035L_drain_memoryBound 5 ns/op 7.02%
on `256-bit` aarch64 platform:
Benchmark (ITERATION_COUNT) Units diff
bench031B_drain_memoryBound 1 ns/op 14.01%
bench031B_drain_memoryBound 2 ns/op 11.00%
bench031B_drain_memoryBound 3 ns/op 12.57%
bench031B_drain_memoryBound 4 ns/op 8.25%
bench031B_drain_memoryBound 5 ns/op 9.71%
bench031B_drain_memoryBound 6 ns/op 7.00%
bench031B_drain_memoryBound 7 ns/op 4.09%
bench031B_drain_memoryBound 8 ns/op 6.48%
bench031B_drain_memoryBound 9 ns/op 4.30%
bench031B_drain_memoryBound 10 ns/op 5.28%
bench031B_drain_memoryBound 11 ns/op 4.58%
bench031B_drain_memoryBound 12 ns/op 3.84%
bench031B_drain_memoryBound 13 ns/op 3.51%
bench031B_drain_memoryBound 14 ns/op 3.49%
bench031B_drain_memoryBound 15 ns/op 3.21%
bench031B_drain_memoryBound 16 ns/op 2.97%
bench031B_drain_memoryBound 17 ns/op 2.04%
bench031B_drain_memoryBound 18 ns/op 1.75%
bench031B_drain_memoryBound 19 ns/op 0.83%
bench031B_drain_memoryBound 20 ns/op 0.92%
bench031B_drain_memoryBound 21 ns/op 1.67%
bench031B_drain_memoryBound 22 ns/op 0.33%
bench031B_drain_memoryBound 23 ns/op 1.02%
bench032S_drain_memoryBound 1 ns/op 12.33%
bench032S_drain_memoryBound 2 ns/op 8.75%
bench032S_drain_memoryBound 3 ns/op 8.75%
bench032S_drain_memoryBound 4 ns/op 7.40%
bench032S_drain_memoryBound 5 ns/op 6.90%
bench032S_drain_memoryBound 6 ns/op 5.33%
bench032S_drain_memoryBound 7 ns/op 7.30%
bench032S_drain_memoryBound 8 ns/op 3.44%
bench032S_drain_memoryBound 9 ns/op 0.59%
bench032S_drain_memoryBound 10 ns/op 1.81%
bench032S_drain_memoryBound 11 ns/op 0.94%
bench032S_drain_memoryBound 12 ns/op 0.80%
bench032S_drain_memoryBound 13 ns/op 0.08%
bench032S_drain_memoryBound 14 ns/op 1.01%
bench032S_drain_memoryBound 15 ns/op 0.55%
bench032S_drain_memoryBound 16 ns/op 0.14%
bench032S_drain_memoryBound 17 ns/op 0.41%
bench032S_drain_memoryBound 18 ns/op 0.22%
bench032S_drain_memoryBound 19 ns/op 0.44%
bench034I_drain_memoryBound 1 ns/op 15.41%
bench034I_drain_memoryBound 2 ns/op 14.37%
bench034I_drain_memoryBound 3 ns/op 10.95%
bench034I_drain_memoryBound 4 ns/op 9.54%
bench034I_drain_memoryBound 5 ns/op 6.94%
bench034I_drain_memoryBound 6 ns/op 7.16%
bench034I_drain_memoryBound 7 ns/op 5.35%
bench034I_drain_memoryBound 8 ns/op 5.13%
bench034I_drain_memoryBound 9 ns/op 5.42%
bench034I_drain_memoryBound 10 ns/op 4.20%
bench034I_drain_memoryBound 11 ns/op 3.83%
bench035L_drain_memoryBound 1 ns/op 12.94%
bench035L_drain_memoryBound 2 ns/op 11.69%
bench035L_drain_memoryBound 3 ns/op 8.99%
bench035L_drain_memoryBound 4 ns/op 8.67%
bench035L_drain_memoryBound 5 ns/op 6.93%
On the `512-bit x86` machine, for the `byte` type, the regression is quite noticeable. A graph might illustrate this more clearly.
<img width="1659" height="1026" alt="bench031B_drain_memoryBound on 512 x86" src="https://github.com/user-attachments/assets/79588b90-9ecc-4c92-a454-cbf523f0e5b8" />
For the other data types:
Benchmark (ITERATION_COUNT) Units diff
bench032S_drain_memoryBound 1 ns/op 5.56%
bench032S_drain_memoryBound 2 ns/op 4.30%
bench032S_drain_memoryBound 3 ns/op 15.05%
bench032S_drain_memoryBound 4 ns/op 10.83%
bench032S_drain_memoryBound 5 ns/op 11.13%
bench032S_drain_memoryBound 6 ns/op 2.27%
bench032S_drain_memoryBound 7 ns/op 11.13%
bench032S_drain_memoryBound 8 ns/op 1.29%
bench032S_drain_memoryBound 9 ns/op 12.30%
bench032S_drain_memoryBound 10 ns/op -2.16%
bench032S_drain_memoryBound 11 ns/op 11.14%
bench032S_drain_memoryBound 12 ns/op 4.56%
bench032S_drain_memoryBound 13 ns/op 10.08%
bench032S_drain_memoryBound 14 ns/op -0.14%
bench032S_drain_memoryBound 15 ns/op 10.33%
bench032S_drain_memoryBound 16 ns/op 0.68%
bench032S_drain_memoryBound 17 ns/op 5.01%
bench032S_drain_memoryBound 18 ns/op -0.12%
bench032S_drain_memoryBound 19 ns/op 1.54%
bench032S_drain_memoryBound 20 ns/op 0.38%
bench032S_drain_memoryBound 21 ns/op 0.65%
bench032S_drain_memoryBound 22 ns/op 4.38%
bench032S_drain_memoryBound 23 ns/op 2.54%
bench032S_drain_memoryBound 24 ns/op -0.46%
bench032S_drain_memoryBound 25 ns/op 0.33%
bench032S_drain_memoryBound 26 ns/op 1.06%
bench032S_drain_memoryBound 27 ns/op 4.41%
bench032S_drain_memoryBound 28 ns/op 0.34%
bench032S_drain_memoryBound 29 ns/op 1.35%
bench032S_drain_memoryBound 30 ns/op 0.58%
bench032S_drain_memoryBound 31 ns/op 3.00%
bench032S_drain_memoryBound 32 ns/op -2.67%
bench032S_drain_memoryBound 33 ns/op 3.62%
bench032S_drain_memoryBound 34 ns/op 3.35%
bench032S_drain_memoryBound 35 ns/op 1.01%
bench032S_drain_memoryBound 36 ns/op -1.65%
bench032S_drain_memoryBound 37 ns/op -1.65%
bench032S_drain_memoryBound 38 ns/op 2.91%
bench032S_drain_memoryBound 39 ns/op 3.44%
bench032S_drain_memoryBound 40 ns/op 1.38%
bench032S_drain_memoryBound 41 ns/op -0.18%
bench032S_drain_memoryBound 42 ns/op 1.58%
bench032S_drain_memoryBound 43 ns/op 2.05%
bench032S_drain_memoryBound 44 ns/op 3.22%
bench032S_drain_memoryBound 45 ns/op -1.45%
bench032S_drain_memoryBound 46 ns/op 0.81%
bench032S_drain_memoryBound 47 ns/op 0.67%
bench032S_drain_memoryBound 48 ns/op 0.26%
bench032S_drain_memoryBound 49 ns/op 2.81%
bench032S_drain_memoryBound 50 ns/op -1.97%
bench032S_drain_memoryBound 51 ns/op 3.71%
bench032S_drain_memoryBound 52 ns/op 2.98%
bench032S_drain_memoryBound 53 ns/op -0.54%
bench032S_drain_memoryBound 55 ns/op 8.44%
bench034I_drain_memoryBound 1 ns/op 10.82%
bench034I_drain_memoryBound 2 ns/op 12.22%
bench034I_drain_memoryBound 3 ns/op 6.62%
bench034I_drain_memoryBound 4 ns/op 11.52%
bench034I_drain_memoryBound 5 ns/op 7.84%
bench034I_drain_memoryBound 6 ns/op 9.48%
bench034I_drain_memoryBound 7 ns/op 7.41%
bench034I_drain_memoryBound 8 ns/op 2.55%
bench034I_drain_memoryBound 9 ns/op 4.28%
bench034I_drain_memoryBound 10 ns/op 6.15%
bench034I_drain_memoryBound 11 ns/op 5.07%
bench034I_drain_memoryBound 12 ns/op 6.84%
bench034I_drain_memoryBound 13 ns/op 3.45%
bench034I_drain_memoryBound 14 ns/op 4.99%
bench034I_drain_memoryBound 15 ns/op 4.34%
bench034I_drain_memoryBound 16 ns/op 7.29%
bench034I_drain_memoryBound 17 ns/op 4.74%
bench034I_drain_memoryBound 18 ns/op 2.25%
bench034I_drain_memoryBound 19 ns/op 6.39%
bench034I_drain_memoryBound 20 ns/op 2.52%
bench034I_drain_memoryBound 21 ns/op 3.82%
bench034I_drain_memoryBound 22 ns/op -0.49%
bench034I_drain_memoryBound 23 ns/op 4.22%
bench034I_drain_memoryBound 24 ns/op 3.17%
bench034I_drain_memoryBound 25 ns/op 2.89%
bench034I_drain_memoryBound 26 ns/op 2.05%
bench034I_drain_memoryBound 27 ns/op 3.43%
bench035L_drain_memoryBound 1 ns/op 7.70%
bench035L_drain_memoryBound 2 ns/op 8.36%
bench035L_drain_memoryBound 3 ns/op 5.62%
bench035L_drain_memoryBound 4 ns/op 0.02%
bench035L_drain_memoryBound 5 ns/op 5.58%
bench035L_drain_memoryBound 6 ns/op 13.26%
bench035L_drain_memoryBound 7 ns/op 6.33%
bench035L_drain_memoryBound 8 ns/op 4.58%
bench035L_drain_memoryBound 9 ns/op 8.82%
bench035L_drain_memoryBound 10 ns/op 2.15%
bench035L_drain_memoryBound 11 ns/op 6.71%
bench035L_drain_memoryBound 12 ns/op 15.44%
bench035L_drain_memoryBound 13 ns/op -1.53%
The marginal performance regressions on the `AArch64` machines and most data types on the `x86` machine are relatively predictable and acceptable. However, the fluctuations observed on the `x86` machine for the `byte` case are somewhat unusual. What do you think?
> It would also be interesting to see a case where the SIZE of the array is not constant, and so the branches become impossible to predict, and there are a lot of branch misses. What do you think?
Regarding this case, I also ran a set of microbenchmarks named `bench03*_drain_dynamic`, which are included in `VectorThroughputForIterationCount.java`. Do these benchmarks make sense to you in the context of this issue?
If so, there’s no noticeable performance regression on either `x86` or `AArch64` platforms — only some performance improvements.
Taking the `256-bit AArch64` platform as an example, here are the results:
`Units: ns/op`
<img width="1650" height="1003" alt="256byte" src="https://github.com/user-attachments/assets/e03428a8-451a-4e13-b5fd-fd7ddebb5ff5" />
<img width="1650" height="1003" alt="256 short" src="https://github.com/user-attachments/assets/af89a665-53fe-4c55-a37d-6df59eb32454" />
<img width="1650" height="1003" alt="256int" src="https://github.com/user-attachments/assets/0d367fff-b4ec-486f-b266-cb32586648d1" />
<img width="1650" height="1003" alt="256long" src="https://github.com/user-attachments/assets/a8a6ff91-9828-4ae2-b2a4-57c6be82158d" />
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22629#issuecomment-3512609154
More information about the hotspot-compiler-dev
mailing list