RFR: 8324751: C2 SuperWord: Aliasing Analysis runtime check [v22]

Mon Aug 25 08:37:08 UTC 2025

On Fri, 22 Aug 2025 16:18:17 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   add test for related report for JDK-8365982
>
> This looks like "rabbit hole" :(
> 
> May be file a separate RFE to investigate this behavior later by some other engineer.  Most concerning is that reproduced on different platforms.
> 
> I agree that we may accept this regression since it happened in corner case. I assume our benchmarks are not affected by this. Right?

@vnkozlov Thanks for having a look!

> I noticed In first (no_patch, fastest) assembler we don't have "strip mining" outer loop. While in other cases we have it. Do you know why?

I've seen that too. I don't know why. The percentages on the "strip mining" outer loop are very low though (about 0.3% of the total block with 96.60%). Maybe it just does not get picked up in one of them? Still a little strange.

The `patch` version has a "strip mined" outer loop for both the fast and slow path:
Loop: N0/N0  has_sfpt

  Loop: N2375/N2377  counted [0,int),+1 (4 iters)  pre multiversion_slow
  Loop: N361/N362  limit_check sfpts={ 364 }
    Loop: N4208/N359  limit_check counted [int,int),+64 (10966 iters)  main multiversion_slow has_sfpt strip_mined
  Loop: N2240/N2242  limit_check counted [int,int),+1 (4 iters)  post multiversion_slow
  Loop: N639/N651  counted [0,int),+1 (4 iters)  pre multiversion_fast
  Loop: N202/N201  limit_check sfpts={ 204 }
    Loop: N3244/N178  limit_check counted [int,int),+512 (10966 iters)  main vector multiversion_fast has_sfpt strip_mined
  Loop: N2591/N2594  limit_check counted [int,int),+64 (64 iters)  post vector multiversion_fast
  Loop: N504/N516  limit_check counted [int,int),+1 (4 iters)  post multiversion_fast

And so does the `not_profitable` version:

Loop: N0/N0  has_sfpt
  Loop: N489/N501  predicated counted [0,int),+1 (4 iters)  pre
  Loop: N213/N212  limit_check sfpts={ 215 }
    Loop: N1879/N189  limit_check counted [int,int),+64 (10034 iters)  main has_sfpt strip_mined
  Loop: N354/N366  limit_check counted [int,int),+1 (4 iters)  post

And in debug, `perfasm` also confirms that, it says that the inner main loop is strip mined, but still does not show the assembly for that:

            ;; B15: #	out( B15 B16 ) <- in( B14 B15 ) Loop( B15-B15 inner main of N117 strip mined) Freq: 4.35414e+08
          ↗  0x00007efee0baad20:   vmovd  %xmm0,%r11d
          │  0x00007efee0baad25:   add    %esi,%r11d
   0.03%  │  0x00007efee0baad28:   movslq %r11d,%r11
          │  0x00007efee0baad2b:   vmovd  %xmm3,%r10d
   2.03%  │  0x00007efee0baad30:   add    %esi,%r10d
          │  0x00007efee0baad33:   movslq %r10d,%r8
   0.13%  │  0x00007efee0baad36:   movslq %esi,%r10
          │  0x00007efee0baad39:   lea    (%rax,%r10,1),%r9
   1.20%  │  0x00007efee0baad3d:   lea    (%r10,%rbp,1),%rbx
          │  0x00007efee0baad41:   movsbl 0x10(%rdx,%rbx,1),%r10d
   1.23%  │  0x00007efee0baad47:   mov    %r10b,0x10(%rcx,%r9,1)
   0.73%  │  0x00007efee0baad4c:   movsbl 0x11(%rdx,%rbx,1),%r10d
   1.63%  │  0x00007efee0baad52:   mov    %r10b,0x11(%rcx,%r9,1)
   0.17%  │  0x00007efee0baad57:   movsbl 0x12(%rdx,%rbx,1),%r10d

> Yes, it could be a lot of reasons we get such regression.

Sadly, yes. Hard to chase them all.

> Did you tried reduce unrolling of slow path.

I can quickly try that for `patch`. It seems that this only makes things worse, the loop overhead gets worse the less we unroll.

LoopMaxUnroll=64 -> 3341.382 ns/op (default)
LoopMaxUnroll=32 -> 3456.612 ns/op
LoopMaxUnroll=16 -> 3711.292 ns/op
LoopMaxUnroll=8  -> 3883.523

>> Might it be the runtime check and related branch misprediction?
>
> Could be since you added outer loop in slow path.

I don't think so. The strip-mined loop still happens inside the slow-path. We don't go back to the runtime check. Both the fast and slow path have a PreMainPost loop structure, where the main-loop is strip-mined. It was the simplest solution to just unswitch/multiversion at the single-iteration step, and otherwise keep the loop structures as before. We made that decision back in https://github.com/openjdk/jdk/pull/22016.
While in some cases we can see the strip-mined loop in the `perfasm` assembly, we cannot see the runtime check at all.

>> tma_backend_bound: 21.3 vs 24.8 - there seems to be a bottleneck in the backend for patch of 10%
>
> This seems indicate more time spent on data access. Did main-loop starts copying from the same offset/element in no_patch vs patch loops?

More time spent on data access -> yes, that is what the number seems to claim. But I don't think there are more data accesses. Rather `not_profitable` just executes more efficiently,  and `patch` executes fewer instructions per cycle. In the JMH, they execute for roughly the same time ~ #cycles. But the number of instructions is about 10% different, and so is the `tma_retiring`.

> cycles: 18,641,133,534 vs 18,247,472,016 - similar number of cycles
> instructions: 42,579,432,139 vs 38,553,272,686 - significant deviation in work per time (10%), but why?
> tma_retiring: 42.4 vs 37.7 - clearly not_profitable executes code more efficiently

Plus: I have a lot of correctness tests, that check that we access the right bytes. I have a lot of examples, and even fuzzing-style tests.

> This looks like "rabbit hole" :(
>
> May be file a separate RFE to investigate this behavior later by some other engineer.

Yes, it is quite a rabbit hole. Yes, at this point it could be good to get the benefits out of the door, and see if we can do something later about the edge-case regression.

> Most concerning is that reproduced on different platforms.

I was hoping that this is not the case. But yes, it reproduces on different platforms - though in slightly different ways and that is strange too.

> I agree that we may accept this regression since it happened in corner case.

Ok good. And if someone really has an issue with it, they can revert back to the old behavior with the product/diagnostic flag `UseAutoVectorizationSpeculativeAliasingChecks`.

> I assume our benchmarks are not affected by this. Right?

I ran it a while ago, but need to run it once more now after all the recent integrations.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24278#issuecomment-3219332076