RFR: 8315916: assert(C->live_nodes() <= C->max_node_limit()) failed: Live Node limit exceeded

Tue Aug 20 23:55:03 UTC 2024

On Fri, 16 Aug 2024 21:28:57 GMT, Dhamoder Nalla <dhanalla at openjdk.org> wrote:

>> Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert.
>> 
>> For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:
>> 1. We have a real bug and by fixing it, we no longer create this many nodes.
>> 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
>> 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).
>> 
>> Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit).
>> 
>> You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.
>
>> Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert.
>> 
>> For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:
>> 
>> 1. We have a real bug and by fixing it, we no longer create this many nodes.
>> 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
>> 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).
>> 
>> Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit).
>> 
>> You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.
> 
> Thanks @chhagedorn for reviewing this PR.
> This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created.
> As an alternative solution, could we consider limiting the JVM option `EliminateAllocationArraySizeLimit` (in `c2_globals.hpp`) to a range between 0 and 1024, instead of the current range of 0 to `max_jint`, as the upper limit of `max_jint` may not be practical?

> > > Hi @dhanalla, this is not the right way to handle this assertion failure. The assertion is here to catch real issues when creating too many nodes due to a bug in the code. For example, in [JDK-8256934](https://bugs.openjdk.org/browse/JDK-8256934), we hit this assert due to an inefficient cloning algorithm in Partial Peeling. We should not remove the assert.
> > > For such bugs, you first need to investigate why we hit the node limit with your reproducer. Once you find the problem, it can usually be put into one of the following categories:
> > > 
> > > 1. We have a real bug and by fixing it, we no longer create this many nodes.
> > > 2. It is a false-positive and it is expected to create this many nodes (note that the node limit of 80000 is quite large, so it needs to be explained well why it is a false-positive - more often than not, there is still a bug somewhere that is first missed).
> > > 3. We have a real bug but the fix is too hard, risky, or just not worth the complexity, especially for a real edge-case (also needs to be explained and justified well).
> > > 
> > > Note that for category 2 and 3, when we cannot easily fix the problem of creating too many nodes, we should implement a bail out fix from the current optimization and not the entire compilation to reduce the performance impact. This was, for example, done in JDK-8256934, where a fix was too risky at that point during the release and a proper fix was delayed. The fix was to bail out from Partial Peeling when hitting a critically high amount of live nodes (an estimate to ensure we never hit the node limit).
> > > You should then describe your analysis in the PR then explain your proposed solution. You should also add the reproducer as test case to your patch.
> > 
> > 
> > Thanks @chhagedorn for reviewing this PR. This scenario corresponds to Case 2 mentioned above, where more than 80,000 nodes are expected to be created. As an alternative solution, could we consider limiting the JVM option `EliminateAllocationArraySizeLimit` (in `c2_globals.hpp`) to a range between 0 and 1024, instead of the current range of 0 to `max_jint`, as the upper limit of `max_jint` may not be practical?
> 
> Hi @dhanalla, can you elaborate more why it is expected and not an actual bug where we unnecessarily create too many nodes?

The test case (ReductionPerf.java) involves multiple arrays, each with a size of 8k. Using the JVM option -XX:EliminateAllocationArraySizeLimit=10240 (which is larger than array size 8k) will enable scalar replacement for all array elements. This, in turn, may result in constructing a graph with over 80k live nodes.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20504#issuecomment-2299947813