RFR: 8331717: C2: Crash with SIGFPE Because Loop Predication Wrongly Hoists Division Requiring Zero Check [v2]

Thu Dec 26 13:31:46 UTC 2024

On Thu, 12 Dec 2024 08:48:14 GMT, Theo Weidmann <tweidmann at openjdk.org> wrote:

>> Fixes a bug in loop predication where not strictly invariant tests involving divisions or modulo are pulled out of the loop.
>> 
>> The bug can be seen in this code:
>> 
>> 
>> public class Reduced {
>>     static int iArr[] = new int[100];
>> 
>>     public static void main(String[] strArr) {
>>         for (int i = 0; i < 10000; i++) {
>>             test();
>>         }
>>     }
>> 
>>     static void test() {
>>         int i1 = 0;
>> 
>>         for (int i4 : iArr) {
>>             i4 = i1;
>>             try {
>>                 iArr[0] = 1 / i4;
>>                 i4 = iArr[2 / i4]; // Source of the crash
>>             } catch (ArithmeticException a_e) {
>>             }
>>         }
>>     }
>> }
>> 
>> 
>> The crucial element is the division `2 / i4`. Since it is used to access an array, it is the input to a range check. See node 230:
>> <img width="699" alt="Screenshot 2024-12-11 at 15 14 47" src="https://github.com/user-attachments/assets/0b2ed978-7135-4a7e-bd10-25c0ffe7a9bb" />
>> 
>> Loop predication will try to pull this range check together with its input, the division, before the `for` loop. Due to a bug in Invariance::compute_invariance loop predication is allowed to do so, which results in the division being pulled out without its non-zero check. 322 is a clone of 230 placed before the loop head without any zero check for the divisor:
>> 
>> <img width="798" alt="Screenshot 2024-12-11 at 15 11 48" src="https://github.com/user-attachments/assets/9d4911cc-9967-4b7a-9969-98e01a55cd0d" />
>> 
>> 
>> More specifically, this bug occurs because 230's zero check (174 If) is not its direct control. Between the zero check and the division is another unrelated check (293 RangeCheck), which can be hoisted:
>> 
>> <img width="825" alt="Screenshot 2024-12-12 at 09 14 37" src="https://github.com/user-attachments/assets/bd572556-1d30-41d5-b74c-22b673963dc2" />
>> 
>> Due to the way the Invariance class works, a check that can be hoisted will be marked as invariant. Then, to determine if any given node is invariant, Invariance::compute_invariance checks if all its inputs are invariant:
>> 
>> https://github.com/openjdk/jdk/blob/ceb4366ebf02f64165acc4a23195e9e3a7398a5c/src/hotspot/share/opto/loopPredicate.cpp#L456-L475
>> 
>> Therefore, when recursively traversing the inputs for 230 Div, the hoisted, unrelated check 293 RangeCheck is hit before the zero check. As that check has been hoisted before already, it is marked invariant and `all_inputs_invariant` will be set to true. (All other inputs are...
>
> Theo Weidmann has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Combine test files

I have looked more deeply into the issue with `depends_only_on_test` in general and I have a really deep concern regarding the current state of how it is handled.

To start with, what is `depends_only_on_test`? Given a node `n` with the control input `c`, if `c` can be deduced from `c'` and `n->depends_only_on_test() == true`, then we can rewire the control input of `n` to `c'`. This means that `depends_only_on_test` does not mean that the node depends on a test, it means that the node depends on the test that is its control input.

For example:

    if (y != 0) {
        if (x > 0) {
            if (y != 0) {
                x / y;
            }
        }
    }

Then `x/y` `depends_only_on_test` because its control input is the test `y != 0`. Then, we can rewire the control input of the division to the outer `y != 0`, resulting in:

    if (y != 0) {
        x / y;
        if (x > 0) {
        }
    }

On the other hand, consider this case:

    if (x > 0) {
        if (y != 0) {
            if (x > 0) {
                x / y;
            }
        }
    }

Then `x/y` does not `depends_only_on_test` because its control input is the test `x > 0` which is unrelated, we can see that if we rewire the division to the outer `x > 0` test, the division floats above the actual test `y != 0`. This means that `depends_only_on_test` is a dynamic property of a node, and not a static property of the division operation. It can change when we transform the graph and it can be different for different nodes of the same kind.

So, what is the issue in [JDK-8257822](https://bugs.openjdk.org/browse/JDK-8257822)? When the zero check of the division is split through `Phi` but the division is not, it is wired to the merge point, not the tests themselves. This means that the division no longer `depends_only_on_test` and should return `false`. However, it still reports that it `depends_only_on_test`, which makes `PhaseIdealLoop::dominated_by` move it out of the loop when its control input is moved.

The fix was to make `PhaseIdealLoop::dominated_by` treat a division as if it does not `depends_only_on_test` when we cannot prove that the divisor is non-zero. This fixed the issue in that particular instance, as it achieved the same result as an actual correct fix of making `depends_only_on_test` return `false`. However, the node still reports itself as `depends_only_on_test`, and that opens more opportunities of miscompilation.

One important consideration regarding nodes that `depends_only_on_test` is that we have to move them from a test to an equivalent test if we want to keep its property of `depends_only_on_test`. Look at the previous example:

    if (y != 0) {
        if (x > 0) {
            if (y != 0) {
                x / y;
            }
        }
    }

If we want to keep `x/y` being `depends_only_on_test`, we have to move it to the outer `y != 0`, and not to the `x > 0`, because in that case:

    if (y != 0) {
        if (x > 0) {
            x / y;
        }
    }

It can easily be seen that the division now does not `depends_only_on_test`. The fix for [JDK-8257822](https://bugs.openjdk.org/browse/JDK-8257822) violated this, which resulted in more cases when the division mistakenly report `depends_only_on_test`.

For example:

    for () {
        if (y != 0) {
            x / y;
        }
    }

Without the fix, `x/y` will be moved out of the loop together with the test `y != 0`, and as it still has `y != 0` as its control input, it is correct when reporting itself as `depends_only_on_test`. However, with the fix:

    if (y != 0) {
    }
    for () {
        x / y;
    }

Now `x/y` has its control input being the loop head, but it still `depends_only_on_test`. This is wrong and opens more chances of miscompilation.

For the case in this issue, it is precisely caused by this wrong approach:

    for () {
        if (y != 0) {
            1 / y;
            if rangecheck {
                if (y != 0) {
                    2 / y;
                }
            }
        }
    }

Because `IfNode::dominated_by` checks for `s->depends_only_on_test() && igvn->no_dependent_zero_check(s)`, we rewire `2 / y` to the `rangecheck`. This makes it wrongly report itself as `depends_only_on_test` when its control input is a `rangecheck`, which is unrelated, which leads to this issue. Without this wrong approach, `2 / y` will be correctly rewired to the outer `y != 0`, and there would be no mistaken hoist of `2 / y` here.

As a result, this will inevitably converge to us doing `s->depends_only_on_test() && igvn->no_dependent_zero_check(s)` at every place `depends_only_on_test` being used, which is really equivalent to divisions being not `depends_only_on_test`. And this would be a safe conservative fix: to make division not `depends_only_on_test`, which effectively pins them to where they are created.

However, it is still not enough. The core issue here is that a node being disconnected from its test but still reporting itself as `depends_only_on_test`, this applies for `Div`, `Mod`, `ConstraintCast`, .... For example:

    int x;
    if (b) {} else {}
    if (x > 3) {
        cast(x, [4, max_int];
    }

Then if the test `x > 3` is split through the region, but the `CastIINode` is not, then its control input would be the merge point, which is not the test it depends on, but the node still reports itself as `depends_only_on_test`. The difference is that a division will immediately complain when its divisor is 0, while a `CastIINode` will not complain anything when the value of the input does not actually lie inside the range of the `CastIINode`. I have tried adding a flag that verifies this and it seems there are multiple failures #22880 . 

I have skimmed through the code and it seems we do actually pin the node when the new control input is not equivalent to the old control input, but only for array accesses? I believe this should be applied to all `depends_only_on_test` nodes, we need to pin all of them when they are no longer directly connected to their tests.

In conclusion, I would strongly prefer a proper fix in which we pin all `depends_only_on_test` nodes when the new control input is not equivalent to the old control input, I believe looking at the places where we currently pin array accesses in `Node::pin_array_access_node` should be adequate. Another approach is to make divisions not `depends_only_on_test`. This should fix most issues we have with `SIGFPE` but it may introduce regressions and it does not solve similar issues for other `depends_only_on_test` nodes, which often silently miscompile.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22666#issuecomment-2562723669