RFR: 8308606: C2 SuperWord: remove alignment checks when not required [v4]

Wed Jun 14 06:26:56 UTC 2023

On Fri, 9 Jun 2023 06:46:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> This change should strictly expand the set of vectorized loops. And this change also makes `SuperWord` conceptually simpler.
>> 
>> As discussed in https://github.com/openjdk/jdk/pull/12350, we should remove the alignment checks when alignment is actually not required (either by the hardware or explicitly asked for with `-XX:+AlignVector`). We did not do it directly in the same task to avoid too many changes of behavior.
>> 
>> This alignment check was originally there instead of a proper dependency checker. Requiring alignments on the packs per memory slice meant that all vector lanes were aligned, and there could be no cross-iteration dependencies that lead to cycles. But this is not general enough (we may for example allow the vector lanes to cross at some point). And we now have proper independence checks in `SuperWord::combine_packs`, as well as the cycle check in `SuperWord::schedule`.
>> 
>> Alignment is nice when we can make it happen, as it ensures that we do not have memory accesses across cache lines. But we should not prevent vectorization just because we cannot align all memory accesses for the same memory slice. As the benchmark shows below, we get a good speedup from vectorizing unaligned memory accesses.
>> 
>> Note: this reduces the `CompileCommand Option Vectorize` flag to now only controlling if we use the `CloneMap` or not. Read more about that in this PR https://github.com/openjdk/jdk/pull/13930. In the benchmarks below you can find some examples that only vectorize with or only vectorize without the `Vectorize` flag. My goal is to eventually try out both approaches and pick the better one, removing the need for the flag entirely (see "**Unifying multiple SuperWord Strategies and beyond**" below).
>> 
>> **Changes to Tests**
>> I could remove the `CompileCommand Option Vectorize` from `TestDependencyOffsets.java`, which means that those loops now vectorize without the need of the flag.
>> 
>> `LoopArrayIndexComputeTest.java` had a few "negative" tests that expeced that there is no vectorization because of "dependencies". But they were not real dependencies since they were "read forward" cases. I now check that those do vectorize, and added symmetric tests that are "read backward" cases which should currently not vectorize. However, these are still not "real dependencies" either: the arrays that are used could in theory be proven to be not equal, and then the dependencies could be dropped. But I think it is ok to leave them as "negative" tests for...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add vm.flagless back in for LoopArrayIndexComputeTest.java

> Alignment is nice when we can make it happen, as it ensures that we do not have memory accesses across cache lines. But we should not prevent vectorization just because we cannot align all memory accesses for the same memory slice. As the benchmark shows below, we get a good speedup from vectorizing unaligned memory accesses.

Hi @eme64 , nice rewrite!

May I ask if you have any benchmark data of misaligned-load-store cases for other data types? For example, `Double` or `Long` on 128-bit machines (maybe aarch64 asimd).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14096#issuecomment-1590549717