[9] RFR(S): 8159016: Over-unrolled loop is partially removed

Fri Jun 24 12:43:13 UTC 2016

Hi Vladimir,

On 23.06.2016 20:29, Vladimir Kozlov wrote:
> On 6/23/16 1:33 AM, Tobias Hartmann wrote:
>> Hi Vladimir,
>>
>> On 22.06.2016 21:45, Vladimir Kozlov wrote:
>>> On 6/22/16 6:22 AM, Tobias Hartmann wrote:
>>>> Hi Vladimir,
>>>>
>>>> thanks for the review!
>>>>
>>>> On 21.06.2016 21:59, Vladimir Kozlov wrote:
>>>>> Thank you, Tobias
>>>>>
>>>>> Your changes are good to have.
>>>>>
>>>>> You can still have a problem if unrolling is done before CCP phase which calculates array size and/or limit.
>>>>>
>>>>> Try to delay size calculation in test until CCP phase. Something like next (please, verify):
>>>>>
>>>>>         // Simplified during CCP:
>>>>>         int size = 2;
>>>>>         for (; size < 4; size *= 2);
>>>>>         Object arr[] = new Object[size - 1];
>>>>
>>>> With a similar piece of code, I'm able to delay size calculation to after CCP but then the out of bound stores are not removed because the trip count Phi has type "int" (instead of "int:>=1"). I think this is because the CastIINode::Value() optimization for CastIIs with _carry_dependency is not able to determine a more restricted type before CCP.
>>>
>>> Got it.
>>>
>>>>
>>>>> Why over-unrolling check code in IdealLoopTree::policy_unroll() does not work?
>>>>
>>>> The code after "// Protect against over-unrolling when init or/and limit are not constant" can only detect over-unrolling if the type of the iv-Phi is narrow enough. In this case, the Phi's type is "#int:>=1:www".
>>>
>>> Should we enhance this check?
>>
>> I'm not sure how. The problem is that we don't have enough type information about the loop Phi because the loop predicate is guarded by OpaqueNodes. We know that the pre-loop is executed once and therefore the initial value of the main loop induction variable is 1, but we are unable to determine the maximum number of iterations (hence "#int:>=1:www").
> 
> Okay, agree that we can do nothing if type is not more concrete.
> 
>>
>>>>> I think the trip count check before main loop should collapse whole main loop code. Then we will not have this problem. Why it is not collapsed?
>>>>
>>>> The trip count check before the main loop is guarded by Opaque nodes and not folded until after loop unrolling. I think it's also better to unroll the loop at least twice instead of over-unrolling and then removing the loop.
>>>
>>> Okay.
>>>
>>>
>>>>
>>>>> About test.
>>>>>
>>>>> You do & 3 two times:
>>>>>
>>>>> int lim = (arg & 3);
>>>>> test(i & 3);
>>>>>
>>>>> Use 11_000 iterations because 10000 may not trigger compilation.
>>>>
>>>> Right, I already fixed this but accidentally uploaded an old version of the test. We actually don't even need 10_000 iterations because the test runs with -Xcomp to avoid gathering "too much" profiling information for the loop. A loop with 42 iterations (for some reason we need some iterations) triggers the problem 100% on my machine.
>>>
>>> Good.
>>>
>>>>
>>>>> Let array escape to avoid EA. For example, return it.
>>>>
>>>> Done, but this triggers another problem: We now crash in PhaseMacroExpand::expand_allocate_common() in line 1600 while adding a membar after an InitializeNode because all memory users were removed. The problem is that now parts of the post loop are removed before we can determine that the entire loop is dead (and it is dead because the pre-loop executes one iteration and the unrolled main loop executes the remaining two iterations). The post loop is only removed after the Opaque1Node that guards the predicate is removed which happens later in macro expansion.
>>>
>>> Hmm, all Opaque nodes are removed before allocation node expansion. But may be it does not enough to collapse post loop which looks like happens only during IGVN transformation after all expansions.
>>
>> Yes, the post loop only collapses after IGVN (i.e. after macro node expansion).
> 
> OK.
> 
>>
>>> But this should not happen. Allocation is done before loop and there are stores in pre-loop and main-loop which should be memory users of allocation. Are those stores collected by InitializeNode? In such case your changes are good (let IGVN code decide about MB removal).
>>
>> Actually, the InitializeNode belongs to the "new Object()" allocation *in* the loop. The StoreNNode stores this new Object in arr[i]. Because the store is removed before the loop is collapsed, the InitializeNode has no memory users and the barrier insertion code fails. I think it should be safe to ignore this and let the subsequent IGVN remove the barrier.
> 
> Can you confirm that MB is removed by IGVN?

I verified that the IGVN after macro expansion removes the entire post loop, i.e., the barrier is removed as well.

Thanks,
Tobias

>>>> I think it should be safe to just add a guard for != NULL like we do in PhaseMacroExpand::process_users_of_allocation(). I verified that the post loop is removed after the Opaque1Node goes away.
>>>>
>>>> New webrev:
>>>> http://cr.openjdk.java.net/~thartmann/8159016/webrev.01/
>>>
>>> Good.
>>>
>>>>
>>>> I'm re-running the testing.
>>>
>>> Add testing link to the bug report.
>>
>> Sure, done (no new failures).
> 
> Good.
> 
> Thanks,
> Vladimir
> 
>>
>> Best regards,
>> Tobias
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>>
>>>> Thanks,
>>>> Tobias
>>>>
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 6/21/16 8:19 AM, Tobias Hartmann wrote:
>>>>>> Hi,
>>>>>>
>>>>>> please review the following fix:
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8159016
>>>>>> http://cr.openjdk.java.net/~thartmann/8159016/webrev.00/
>>>>>>
>>>>>> I was able to reproduce the problem with a simple test:
>>>>>>
>>>>>>  Object arr[] = new Object[3];
>>>>>>  int lim = (arg & 3);
>>>>>>  for (int i = 0; i < lim; ++i) {
>>>>>>    arr[i] = new Object();
>>>>>>  }
>>>>>>
>>>>>> The loop is split into pre, main and post loops. The pre loop is executed for at least one iteration, initializing arr[0]. The main loop is unrolled twice, initializing at least arr[1], arr[2], arr[3] and arr[4]. The arr[3] and arr[4] stores are always out of bounds and removed. However, C2 is unable to remove the "over-unrolled", dead main loop. As a result, there is a control path from the main loop to the PostScalarRce loop (see JDK-8151573) without a memory path since the last store was replaced by TOP. We crash because we use a memory edge from a non-dominating region.
>>>>>>
>>>>>> The out of bound stores in the over-unrolled loop are removed because their range check dependent CastII nodes are removed by the first optimization in CastIINode::Ideal():
>>>>>>  CastII (AddI Phi 2) -> AddI (CastII Phi) 2 -> AddI TOP 2
>>>>>>
>>>>>> The CastIINodes are replaced by TOP because the type of the loop Phi is >= 1 and the CastII is [1..2], i.e. there is no value >= 1 that is in the [1..2] range if +2 is added.
>>>>>>
>>>>>> The underlying problem is the over-unrolling of the loop. Since lim < 3, we always only execute at least 3 loop iterations. With the pre loop executing at least one iteration, the main loop should not be unrolled more than once. This problem is similar to JDK-4834191 where we over-unrolled a loop with a constant limit.
>>>>>>
>>>>>> I changed the implementation of IdealLoopTree::compute_exact_trip_count() to not only compute the exact trip count if the loop limit is constant but to also set the _trip_count field if we are able to determine an upper bound. Like this, the checks in IdealLoopTree::policy_unroll() prevent additional loop unrolling if the trip_count became too small and we keep the maximally unrolled version.
>>>>>>
>>>>>> An alternative fix would be to disable the CastII optimization for range check dependent CastII nodes but this does not fix the underlying problem of over-unrolling.
>>>>>>
>>>>>> Tested with regression test, JPRT and RBT (running).
>>>>>>
>>>>>> Thanks,
>>>>>> Tobias
>>>>>>