RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory

Tue Feb 18 19:26:04 UTC 2025

On Tue, 18 Feb 2025 10:07:07 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>>> That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that?
>> 
>> There is code that removes the `OuterStripMinedLoop` if the `CountedLoop` goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoop` is left behind without a `CountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity.
>
>> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that?
> 
>>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity.
> 
> Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one?

@eme64, my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance.   Do we really need it for this changes? Can we simply generate un-vectorized loop?

" x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support?

An other question is what deoptimization `Action` is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666176147