RFR: 8334724: C2: remove PhaseIdealLoop::cast_incr_before_loop() [v2]

Thu Jul 11 10:01:59 UTC 2024

On Mon, 24 Jun 2024 12:29:38 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> I propose removing `PhaseIdealLoop::cast_incr_before_loop()` and the
>> `CastII` nodes that it adds at counted loop heads.
>> 
>> They were added to prevent nodes to float above the zero trip guard
>> when the backedge of a counted loop is removed. In particular, when a
>> range check is hoisted by predication, pre/main/post loops are created
>> and if one of the main or post loops lose its backedge, an array load
>> that's control dependent on a predicate above the pre loop could float
>> above the zero trip guard of the main or post loop. That can no longer
>> happen AFAICT with changes related to assert predicates. The array
>> load is now updated to have a control dependency that's below the zero
>> trip guard.
>> 
>> The reason I'm revisiting this is that I noticed that
>> `PhaseIdealLoop::cast_incr_before_loop()` has a bug. When it adds the
>> `CastII`, it looks for the loop phi and picks input 1 of the phi it
>> finds as input to the `CastII`. To find the loop phi, it starts from
>> the loop incremement and loop for a use that's a phi and has the loop
>> head as control. It never checks that the phi it finds is the loop
>> phi. There can be more than one phi as uses of the increment at the
>> loop head and it can pick the wrong one. I tried to write a test case
>> where this would cause a bug but couldn't actually find any use for
>> the `CastII` anymore.
>> 
>> In my testing, the only issue when the `CastII` are not added is that
>> some IR tests for vectorization fails:
>> 
>> compiler/vectorization/TestPopulateIndex.java
>> compiler/vectorization/runner/ArrayShiftOpTest.java
>> compiler/vectorization/runner/LoopArrayIndexComputeTest.java
>> 
>> because removing the `CastII` causes split if to occur with some nodes
>> that take the loop phi as input. That then causes pattern matching
>> during superword to break. I added logic to prevent split if for those
>> cases.
>
> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision:
> 
>   review

Christian ran performance testing with this change and reported 2 regressions:

Crypto-ChaCha20Poly1305.decrypt - Windows x64:
-1.97%, - 2.25%, -2.27%
openjdk.bench.javax.crypto.small.SecureRandomBench.nextBytes-algorithm:NativePRNG-dataSize:64-provider:-shared:true - MacOSX aarch64:
-7.63%, -6.37%, -12.27%

For the regression on windows:
Most of the time is spent in stub routines. Only about 20% of the time is spent in a compiled method `javax.crypto.Cipher::doFinal`. This patch doesn't affect the stub routines so `javax.crypto.Cipher::doFinal` have to be slowed down significantly (10%) to have a total slow down of 2%. Christian provided the profiler output of running the benchmark with and without the patch on the system where the regression was observed. The snippets of code where time is spent in `javax.crypto.Cipher::doFinal` are identical with or without the patch so it's hard to see how this patch could cause a 10% regression on that particular method.

For the regression on macos:
Running the benchmark on a similar system with or without the patch, I see the score can vary quite a bite from one run to the other (by ~10%). Running with a profiler, I see that ~50% of the time is spent in compiled code. I observed that while 50% of the time is spent in a compiled method, it's not always the same method. Inlining decisions change from one run to another. Christian ran performance testing to check the effect of inlining with:

-XX:CompileCommand=option,sun.security.provider.SecureRandom::engineNextBytes,inline

he reported a smaller regression (-2%)
with:

-XX:CompileCommand=option,sun.security.provider.SecureRandom::engineNextBytes,dontinline 

no regression anymore.
I looked at the snippet of compiled code that's reported as the hottest (a counted loop). While the code is largely similar it's a bit different (I see an extra spill and blocks are not laid out the same way) so I also looked at the IR for the same runs and I see no difference. So again, it's hard to see how the patch can cause the regression.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19831#issuecomment-2222522453