RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory
Emanuel Peter
epeter at openjdk.org
Wed Feb 19 07:19:56 UTC 2025
On Tue, 18 Feb 2025 19:18:34 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:
>> Note: the approach with Predicates and Multiversioning prepares us well for Runtime Checks for Aliasing Analysis, see more below.
>>
>> **Background**
>>
>> With `-XX:+AlignVector`, all vector loads/stores must be aligned. We try to statically determine if we can always align the vectors. One condition is that the address `base` is already aligned. For arrays, we know that this always holds, because they are `ObjectAlignmentInBytes` aligned. But with native memory, the `base` is just some arbitrarily aligned pointer.
>>
>> **Problem**
>>
>> So far, we have just naively assumed that the `base` is always `ObjectAlignmentInBytes` aligned. But that does not hold for `native` memory segments: the `base` can also be unaligned. I had constructed such an example, and with `-XX:+AlignVector -XX:+VerifyAlignVector` this example hits the verification code.
>>
>>
>> MemorySegment nativeAligned = Arena.ofAuto().allocate(RANGE * 4 + 1);
>> MemorySegment nativeUnaligned = nativeAligned.asSlice(1);
>> test3(nativeUnaligned);
>>
>>
>> When compiling the test method, we assume that the `nativeUnaligned.address()` is aligned - but it is not!
>>
>> static void test3(MemorySegment ms) {
>> for (int i = 0; i < RANGE; i++) {
>> long adr = i * 4L;
>> int v = ms.get(ELEMENT_LAYOUT, adr);
>> ms.set(ELEMENT_LAYOUT, adr, (int)(v + 1));
>> }
>> }
>>
>>
>> **Solution: Runtime Checks - Predicate and Multiversioning**
>>
>> Of course we could just forbid cases where we have a `native` base from vectorizing. But that would lead to regressions currently - in most cases we do get aligned `base`s, and we currently vectorize those. We cannot statically determine if the `base` is aligned, we need a runtime check.
>>
>> I came up with 2 options where to place the runtime checks:
>> - A new "auto vectorization" Parse Predicate:
>> - This only works when predicates are available.
>> - If we fail the predicate, then we recompile without the predicate. That means we cannot add a check to the predicate any more, and we would have to do multiversioning at that point if we still want to have a vectorized loop.
>> - Multiversion the loop:
>> - Create 2 copies of the loop (fast and slow loops).
>> - The `fast_loop` can make speculative alignment assumptions, and add the corresponding check to the `multiversion_if` which decides which loop we take
>> - In the `slow_loop`, we make no assumption which means we can not vectorize, but we still compile - so even ...
>
> About actual probability value. I was thinking PROB_LIKELY_MAG(3). PROB_LIKELY_MAG(1) will only guarantee that vectorized loop will be first but it could be enough without moving other loop from hot path. Needs testing.
@vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesting
benchmark results there.
Does that sound ok?
> Can we profile alignment in Interpreter (and C1)?
It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
What do you think?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2667703955
More information about the hotspot-dev
mailing list