RFR: 8355585: Aarch64: Add aarch64 backend for Float16 vector operations [v4]

Thu May 22 06:19:55 UTC 2025

On Wed, 21 May 2025 16:44:35 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> @Bhavana-Kilambi Ok, yes, 20min is a bit excessive 😆 
>> 
>> Generally, we should periodically run all vector tests with various `MaxVectorSize` settings. But doing that all the time is often too time consuming. For some specific tests, it can make sense though to iterate over multiple sizes.
>> 
>> I wonder if you could also reduce the runtime of the test in other ways? Maybe reduce the warmup? It seems a bit excessive to do `10000` warmup iterations, which each execute a loop with many iterations themselves.
>
> Hi @eme64 I removed the `@Warmup` entirely and the test does pass on aarch64. Although I am a bit afraid to fully remove it as it could sometimes lead to the loop not being warm enough for c2 vectorization to kick in. I haven't tried with different values of the warmup iterations though. Do you think it's ok to remove it entirely?

@Bhavana-Kilambi The TestFramework actually forces C2 compilation:
- runs warmup iterations, maybe C2 triggers automatically because there are enough iterations.
- Once warmup is over, the TestFramework checks if the method is already compiled, if not, it enqueues it.
- In the end, we know it is C2 compiled, which gives us the C2 IR we can match with.

In my experience, having low warmup count works in most cases. Except when you need profiling data. If you have zero warup, we basically have compilation with `-Xcomp`.

So it really depends on your specific case. In general, I would avoid doing an `Xcomp` compilation / zero warmup, because then we do not test normal compilation with profiling. And compilation with profiling is more important I think.

But in cases where you have a large loop in the test method, we would trigger OSR and normal compilation with profiling rather soon anyway. So lowering the warmup is ok. How many loop iterations do we need for OSR?
`product(intx, Tier4BackEdgeThreshold, 40000`. We could round that up to `100_000`, just to be sure. With `LEN = 2048`, you would thus only need about `50` invocations of the tests during warmup to reach C2 compilation. Hence, the current `@Warmup(10000)` is much too high, I think. You could cut down the runtime by about a factor of `100` here, if my math is correct :exploding_head: 

What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25096#issuecomment-2900043919