RFR: 8324751: C2 SuperWord: Aliasing Analysis runtime check [v22]
Emanuel Peter
epeter at openjdk.org
Mon Aug 25 08:37:08 UTC 2025
On Fri, 22 Aug 2025 16:18:17 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:
>> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
>>
>> add test for related report for JDK-8365982
>
> This looks like "rabbit hole" :(
>
> May be file a separate RFE to investigate this behavior later by some other engineer. Most concerning is that reproduced on different platforms.
>
> I agree that we may accept this regression since it happened in corner case. I assume our benchmarks are not affected by this. Right?
@vnkozlov Thanks for having a look!
> I noticed In first (no_patch, fastest) assembler we don't have "strip mining" outer loop. While in other cases we have it. Do you know why?
I've seen that too. I don't know why. The percentages on the "strip mining" outer loop are very low though (about 0.3% of the total block with 96.60%). Maybe it just does not get picked up in one of them? Still a little strange.
The `patch` version has a "strip mined" outer loop for both the fast and slow path:
Loop: N0/N0 has_sfpt
Loop: N2375/N2377 counted [0,int),+1 (4 iters) pre multiversion_slow
Loop: N361/N362 limit_check sfpts={ 364 }
Loop: N4208/N359 limit_check counted [int,int),+64 (10966 iters) main multiversion_slow has_sfpt strip_mined
Loop: N2240/N2242 limit_check counted [int,int),+1 (4 iters) post multiversion_slow
Loop: N639/N651 counted [0,int),+1 (4 iters) pre multiversion_fast
Loop: N202/N201 limit_check sfpts={ 204 }
Loop: N3244/N178 limit_check counted [int,int),+512 (10966 iters) main vector multiversion_fast has_sfpt strip_mined
Loop: N2591/N2594 limit_check counted [int,int),+64 (64 iters) post vector multiversion_fast
Loop: N504/N516 limit_check counted [int,int),+1 (4 iters) post multiversion_fast
And so does the `not_profitable` version:
Loop: N0/N0 has_sfpt
Loop: N489/N501 predicated counted [0,int),+1 (4 iters) pre
Loop: N213/N212 limit_check sfpts={ 215 }
Loop: N1879/N189 limit_check counted [int,int),+64 (10034 iters) main has_sfpt strip_mined
Loop: N354/N366 limit_check counted [int,int),+1 (4 iters) post
And in debug, `perfasm` also confirms that, it says that the inner main loop is strip mined, but still does not show the assembly for that:
;; B15: # out( B15 B16 ) <- in( B14 B15 ) Loop( B15-B15 inner main of N117 strip mined) Freq: 4.35414e+08
↗ 0x00007efee0baad20: vmovd %xmm0,%r11d
│ 0x00007efee0baad25: add %esi,%r11d
0.03% │ 0x00007efee0baad28: movslq %r11d,%r11
│ 0x00007efee0baad2b: vmovd %xmm3,%r10d
2.03% │ 0x00007efee0baad30: add %esi,%r10d
│ 0x00007efee0baad33: movslq %r10d,%r8
0.13% │ 0x00007efee0baad36: movslq %esi,%r10
│ 0x00007efee0baad39: lea (%rax,%r10,1),%r9
1.20% │ 0x00007efee0baad3d: lea (%r10,%rbp,1),%rbx
│ 0x00007efee0baad41: movsbl 0x10(%rdx,%rbx,1),%r10d
1.23% │ 0x00007efee0baad47: mov %r10b,0x10(%rcx,%r9,1)
0.73% │ 0x00007efee0baad4c: movsbl 0x11(%rdx,%rbx,1),%r10d
1.63% │ 0x00007efee0baad52: mov %r10b,0x11(%rcx,%r9,1)
0.17% │ 0x00007efee0baad57: movsbl 0x12(%rdx,%rbx,1),%r10d
> Yes, it could be a lot of reasons we get such regression.
Sadly, yes. Hard to chase them all.
> Did you tried reduce unrolling of slow path.
I can quickly try that for `patch`. It seems that this only makes things worse, the loop overhead gets worse the less we unroll.
LoopMaxUnroll=64 -> 3341.382 ns/op (default)
LoopMaxUnroll=32 -> 3456.612 ns/op
LoopMaxUnroll=16 -> 3711.292 ns/op
LoopMaxUnroll=8 -> 3883.523
>> Might it be the runtime check and related branch misprediction?
>
> Could be since you added outer loop in slow path.
I don't think so. The strip-mined loop still happens inside the slow-path. We don't go back to the runtime check. Both the fast and slow path have a PreMainPost loop structure, where the main-loop is strip-mined. It was the simplest solution to just unswitch/multiversion at the single-iteration step, and otherwise keep the loop structures as before. We made that decision back in https://github.com/openjdk/jdk/pull/22016.
While in some cases we can see the strip-mined loop in the `perfasm` assembly, we cannot see the runtime check at all.
>> tma_backend_bound: 21.3 vs 24.8 - there seems to be a bottleneck in the backend for patch of 10%
>
> This seems indicate more time spent on data access. Did main-loop starts copying from the same offset/element in no_patch vs patch loops?
More time spent on data access -> yes, that is what the number seems to claim. But I don't think there are more data accesses. Rather `not_profitable` just executes more efficiently, and `patch` executes fewer instructions per cycle. In the JMH, they execute for roughly the same time ~ #cycles. But the number of instructions is about 10% different, and so is the `tma_retiring`.
> cycles: 18,641,133,534 vs 18,247,472,016 - similar number of cycles
> instructions: 42,579,432,139 vs 38,553,272,686 - significant deviation in work per time (10%), but why?
> tma_retiring: 42.4 vs 37.7 - clearly not_profitable executes code more efficiently
Plus: I have a lot of correctness tests, that check that we access the right bytes. I have a lot of examples, and even fuzzing-style tests.
> This looks like "rabbit hole" :(
>
> May be file a separate RFE to investigate this behavior later by some other engineer.
Yes, it is quite a rabbit hole. Yes, at this point it could be good to get the benefits out of the door, and see if we can do something later about the edge-case regression.
> Most concerning is that reproduced on different platforms.
I was hoping that this is not the case. But yes, it reproduces on different platforms - though in slightly different ways and that is strange too.
> I agree that we may accept this regression since it happened in corner case.
Ok good. And if someone really has an issue with it, they can revert back to the old behavior with the product/diagnostic flag `UseAutoVectorizationSpeculativeAliasingChecks`.
> I assume our benchmarks are not affected by this. Right?
I ran it a while ago, but need to run it once more now after all the recent integrations.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/24278#issuecomment-3219332076
More information about the hotspot-compiler-dev
mailing list