[15] RFR(S): 8230402: Allocation of compile task fails with assert: "Leaking compilation tasks?"

Wed Apr 29 17:07:39 UTC 2020

webrev.02 looks good to me.

Thanks,
Vladimir

On 4/29/20 2:26 AM, Christian Hagedorn wrote:
> Hi Vladimir
> 
> On 29.04.20 03:43, Vladimir Kozlov wrote:
>> On 4/28/20 8:36 AM, Christian Hagedorn wrote:
>>> Hi Vladimir
>>>
>>>>>> May be we should mark methods which are removed from queue or use counters decay or use other mechanisms to 
>>>>>> prevent methods be put back into queue immediately because their counters are high. You may not need to remove 
>>>>>> half of queue in such case.
>>>>>
>>>>> You mean we could, for example, just reset the invocation and backedge counters of removed methods from the queue? 
>>>>> This would probably be beneficial in a more general case than in my test case where each method is only executed 
>>>>> twice. About the number of tasks to drop, it was just a best guess. We can also choose to drop fewer. But it is 
>>>>> probably hard to determine a best value in general.
>>>>
>>>> An other thought. Instead of removing tasks from queue may be we should not put new tasks on queue when it become 
>>>> almost full (but continue profiling methods). For that we need a parameter (or diagnostic flag) instead of 10000 
>>>> number.
>>>
>>> That also sounds reasonable. But then we might miss on new hot methods while the queue could contain many cold methods.
>>
>> The hotter a method the sooner it will be put on queue for compilation. The only case I can think of is recompilation 
>> of hot method due to deoptimization. May be spending more time in Interpreter is not bad thing.
> 
> Yes, that seems okay.
> 
>>>
>>>> We are not using counters decay in Tiered mode because we are loosing/corrupting profiling data with it. We should 
>>>> avoid this. I just gave an example of what could be done.
>>>
>>> Okay.
>>>
>>>> One concern I have is that before it was check in debug VM. Now we putting limitation on compilations in product VM 
>>>> which may affect performance in some cases. We should check that.
>>>
>>> I ran some standard benchmarks and have not observed any regressions. However, I also ran these benchmarks with the 
>>> original code and substituted the assert by a guarantee. It was never hit (i.e. the new code never executed and thus 
>>> had no influence). It could still affect other benchmarks and programs in some unexpected way. But I think it is very 
>>> unlikely to hit the threshold in a normal program.
>>>
>>> Therefore, I think it is kinda a trade-off between complexity of the solution and likelihood that those special cases 
>>
>> I completely agree with you on this.
>>
>>> occur. So, to summarize all the current options:
>>> 1) Just remove the assert. But then we miss cases where we actually have duplicates in the queue
>>> 2) webrev.01 to drop half of the tasks. We can check for duplicates before dropping. Can also change the number of 
>>> tasks to drop.
>>>      a) do it only in debug builds. Performance would not be significant.
>>>      b) do it also in product builds. New limitation that might impact performance on some other benchmarks/programs 
>>> but I think it's not very likely.
>>> 3a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
>>> 4) Use your suggestion to stop enqueuing tasks at a certain threshold with a parameter or flag. Simple solution but 
>>> might miss some new hot methods and it needs a different trigger to check for duplicates to avoid checking it too 
>>> many times.
>>>      a) do it only in debug builds. Performance would not be significant.
>>>      b) do it also in product builds. Performance might be impacted on some other benchmarks/programs but I think not 
>>> very likely.
>>> 5a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
>>
>> There should be no duplicates in queue - we check for that:
>> http://hg.openjdk.java.net/jdk/jdk/file/06745527c7b8/src/hotspot/share/compiler/compileBroker.cpp#l1120
>> Unless we screwed up JVM_ACC_QUEUED bit setting.
> 
> I see, then it might not really be necessary to check for duplicates yet again.
> 
>> I think original assert was added in 8040798 to check that a task is always put into _task_free_list when we finish 
>> compilation (or abort compilation).
> 
> Thanks for clearing that up. I was not aware of that intention. I first thought it had only the purpose of finding 
> strange things inside the compile queue.
> 
>> I don't think we should have different behavior (remove tasks from queue or not put task in queue) in product and 
>> debug builds. If we do that we have to do in both. We use mostly fastdebug build for testing - we should execute the 
>> same code as in product as close as possible.
> 
> I agree with that. Doing it only in debug builds leads to a too different behavior.
> 
>> The real case is when all C2 compiler threads hangs (or takes very long time to compile) and no progress is done on 
>> compiling other methods in queue. But it should be very rare that all compiling threads hangs. And we can catch such 
>> cases by other checks.
>> > Based on all that I would go with 1).  As you pointed in bug report we
>> not observing this assert anymore (only with hand made test).
> 
> Thank you for explaining it in more detail. When we have other checks that can detect such a situation where all 
> compiling threads are hanging then we are probably fine by just removing that assert.
> 
> I updated my webrev with that option. I left the updated stress test there:
> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.02/
> 
> What do others think?
> 
> Best regards,
> Christian
> 
>> Thanks,
>> Vladimir
>>
>>>
>>> What do you think?
>>>
>>> Best regards,
>>> Christian
>>>
>>>>>
>>>>>>
>>>>>> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> Please review the following patch:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230402
>>>>>>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>>>>>>
>>>>>>> This assert was hit very intermittently in an internal test until jdk-14+19. The test was changed afterwards and 
>>>>>>> the assert was not observed to fail anymore. However, the problem of having too many tasks in the queue is still 
>>>>>>> present (i.e. the compile queue is growing too quickly and the compiler(s) too slow to catch up). This assert can 
>>>>>>> easily be hit by creating many class loaders which load many methods which are immediately compiled by setting a 
>>>>>>> low compilation threshold as used in runA() in the testcase.
>>>>>>>
>>>>>>> Therefore, I suggest to tackle this problem with a general solution to drop half of the compilation tasks in 
>>>>>>> CompileQueue::add() when a queue size of 10000 is reached and none of the other conditions of this assert hold 
>>>>>>> (no Whitebox or JVMCI compiler). For tiered compilation, the tasks with the lowest method weight() or which are 
>>>>>>> unloaded are removed from the queue (without altering the order of the remaining tasks in the queue). Without 
>>>>>>> tiered compilation (i.e. SimpleCompPolicy), the tasks from the tail of the queue are removed. An additional 
>>>>>>> verification in debug builds should ensure that there are no duplicated tasks. I assume that part of the reason 
>>>>>>> of the original assert was to detect such duplicates.
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Christian
>>>>>>>