RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory

Tue Feb 18 19:26:07 UTC 2025

On Tue, 18 Feb 2025 16:10:20 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>>> > That one is more tricky. Because what if the loop somehow gets folded away? How would we catch that?
>> 
>>>There is code that removes the OuterStripMinedLoop if the CountedLoop goes away and also, if I recall correctly, logic that verifies no ``OuterStripMinedLoopis left behind without aCountedLoop` so it's probably possible. Question is whether we want that or not. Seems like quite a bit of extra complexity.
>> 
>> Hmm ok, I see. I wonder how bad it is to leave the slow-loop there until after loop-opts. I mean it was already created, and it now has no loop-opts performed on it (it is stalled), so it just sits there like dead code. So I'm not sure there is really a performance benefit to kill it already a little earlier. Maybe a very small one?
>
> @eme64, my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance.   Do we really need it for this changes? Can we simply generate un-vectorized loop?
> 
> " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support?
> 
> An other question is what deoptimization `Action` is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too?

@vnkozlov 

> " x86 and aarch64 are unaffected". Which platforms are affected? Do we really should sacrifice code complexity for platforms we don't support?

I would say most of the code here, i.e. the predicate and multi-version parts are also relevant for the up-coming patch for aliasing analysis runtime-checks. These are especially important for `MemorySegment` cases where there could basically always be aliasing and only runtime-checks can help us vectorize.

There is really only a small part, which is emitting the actual alignment-check.

> Do we really need it for this changes? Can we simply generate un-vectorized loop?

The alternatives on architectures that are actually affected by this bug:
- Not fix the bug, and risk possible `SIGBUS`. And on our platforms, that just means living with the HALT caused by `VerifyAlignVector`.
- Disable ALL vectorization of cases where we cannot guarantee statically that accesses are aligned. That would certainly disable all uses of `MemorySegment`, and that is probably not preferrable.

> my main concern is loop multi versions code will blowup inlining decisions. Our benchmarks may not be affected because we nay never trigger multi versions code on our hardware (as Roland pointed). May be you can force its generation and then compare performance.

Right. I suppose code size might be slightly affected. But I only multi-version if we are already going to pre-main-post the loop. And that means that the loop is already copied 3x, and doing 4x is not that noticable I would suspect. Also, with OSR we already currently don't generate predicates, and so it is generating the multi-versioning for those. And I really could not measure any difference in the performance benchmarking. I doubt it is even noticable on compile-time.

> An other question is what deoptimization Action is taken when predicate is failed? I saw comment in code "We only want to use the auto-vectorization check as a trap once per bci." Does it mean you immediately deoptimize code? Can we hit uncommon trap few times before deoptimization? Deoptimization after one trap assumes we will process the same un-aligned data again. In a test it could be true but in reality is it true too?

Yes, when we deopt for the bci, we recompile immediately. The alternative is to make the check per method, but then the risk is that one loop deopting causes other loops to be multi-versioned instead of using predicates too. Counting deopts per bci is currently not done at all. But I suppose we could make it a bit more "forgiving"... but is that worth it? I suppose if in reallity we do see non-aligned cases (or in the future cases where we have problematic aliasing), then it will probably repeat, and is worth recompiling to handle both cases. But that is speculation, and we can discuss :)

TLDR:
@vnkozlov I would not have fixed the bug with such a heavy mechanism if I did not intend to use it for runtime-check for aliasing analysis. And 90% of the code here is reusable for that.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2666357998