RFR: 8323582: C2 SuperWord AlignVector: misaligned vector memory access with unaligned native memory

Mon Feb 24 07:25:59 UTC 2025

On Wed, 19 Feb 2025 16:14:09 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> @vnkozlov I suggest that I change the probability to something quite low now, just to make sure that the fast-loop is placed nicely. When I do the experiments for aliasing-analysis runtime-checks, then I will be able to benchmark much better for both cases, since it is much easier to create many different cases. At that point, I could still adapt the probabilities to a different constant. Or maybe I can somehow adjust the probabilities in the chain such that they are balanced. Like if there is 1 condition, give it `0.5`, if there are 2 give them each `sqrt(0.5)`, if there are `n` then `pow(0.5, 1/n)`, so that once you multiply them you get `pow(pow(0.5, 1/n),n) = 0.5`. We could also set another "target" probability than `0.5`. The issue is that experimenting now is a little difficult, because I only have the alignment-checks to play with, which are really really rare to fail in the "real world", I think. But aliasing-checks are more likely to fail, so there could be more interesti
 ng benchmark results there.
>> 
>> Does that sound ok?
>> 
>>> Can we profile alignment in Interpreter (and C1)?
>> 
>> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
>> 
>> What do you think?
>
>> > Can we profile alignment in Interpreter (and C1)?
>> 
>> It would be nice if we could profile alignment or aliasing. Maybe that is possible. But I suppose there are always cases where profiling is not available (Xcomp ?), and we should have reasonable defaults there. We could investigate profiling in a second step, to improve things if we think that is worth it. Profiling these things would also be additional complexity - I'm not convinced yet it is worth it.
>> 
>> What do you think?
> 
> You should not worry about `-Xcomp` it is testing flag - we can use some default there.
> I am fine if you think profiling will not bring us much benefits. Note, I am not asking create counters - just a bit to indicate if we had unaligned access to native memory in a method. In such case we may skip predicate and generate multi versions loop during compilation. On other hand, we may have unaligned access only during startup and not later when we compile method. Anyway, it does not affect these changes.
> 
> I will look on changes more later.

@vnkozlov I'll think about the "stall" vs "delay" suggestion.

> How profitable (performance wise) to optimize slow path loop? Can we skip any optimizations for it - treat it as not-Counted?

I suppose that depends on if the slow path loop will be taken. Imagine we are working on some unaligned MemorySegment (or with aliasing runtime-checks failing). In these cases without optimizing we would for example not unroll. But unrolling can give quite the speedup, of course at the cost of more compile time and code size. Also some RangeCheck eliminations only happen if you have a pre-main-post loop structure. There are probably other optimizations as well. So yes, if the slow path loop is taken often, then optimizing is probably worth it. What do you think?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22016#issuecomment-2677607527