[15] RFR(S): 8230402: Allocation of compile task fails with assert: "Leaking compilation tasks?"

Wed Apr 29 09:26:30 UTC 2020

Hi Vladimir

On 29.04.20 03:43, Vladimir Kozlov wrote:
> On 4/28/20 8:36 AM, Christian Hagedorn wrote:
>> Hi Vladimir
>>
>>>>> May be we should mark methods which are removed from queue or use 
>>>>> counters decay or use other mechanisms to prevent methods be put 
>>>>> back into queue immediately because their counters are high. You 
>>>>> may not need to remove half of queue in such case.
>>>>
>>>> You mean we could, for example, just reset the invocation and 
>>>> backedge counters of removed methods from the queue? This would 
>>>> probably be beneficial in a more general case than in my test case 
>>>> where each method is only executed twice. About the number of tasks 
>>>> to drop, it was just a best guess. We can also choose to drop fewer. 
>>>> But it is probably hard to determine a best value in general.
>>>
>>> An other thought. Instead of removing tasks from queue may be we 
>>> should not put new tasks on queue when it become almost full (but 
>>> continue profiling methods). For that we need a parameter (or 
>>> diagnostic flag) instead of 10000 number.
>>
>> That also sounds reasonable. But then we might miss on new hot methods 
>> while the queue could contain many cold methods.
> 
> The hotter a method the sooner it will be put on queue for compilation. 
> The only case I can think of is recompilation of hot method due to 
> deoptimization. May be spending more time in Interpreter is not bad thing.

Yes, that seems okay.

>>
>>> We are not using counters decay in Tiered mode because we are 
>>> loosing/corrupting profiling data with it. We should avoid this. I 
>>> just gave an example of what could be done.
>>
>> Okay.
>>
>>> One concern I have is that before it was check in debug VM. Now we 
>>> putting limitation on compilations in product VM which may affect 
>>> performance in some cases. We should check that.
>>
>> I ran some standard benchmarks and have not observed any regressions. 
>> However, I also ran these benchmarks with the original code and 
>> substituted the assert by a guarantee. It was never hit (i.e. the new 
>> code never executed and thus had no influence). It could still affect 
>> other benchmarks and programs in some unexpected way. But I think it 
>> is very unlikely to hit the threshold in a normal program.
>>
>> Therefore, I think it is kinda a trade-off between complexity of the 
>> solution and likelihood that those special cases 
> 
> I completely agree with you on this.
> 
>> occur. So, to summarize all the current options:
>> 1) Just remove the assert. But then we miss cases where we actually 
>> have duplicates in the queue
>> 2) webrev.01 to drop half of the tasks. We can check for duplicates 
>> before dropping. Can also change the number of tasks to drop.
>>      a) do it only in debug builds. Performance would not be significant.
>>      b) do it also in product builds. New limitation that might impact 
>> performance on some other benchmarks/programs but I think it's not 
>> very likely.
>> 3a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
>> 4) Use your suggestion to stop enqueuing tasks at a certain threshold 
>> with a parameter or flag. Simple solution but might miss some new hot 
>> methods and it needs a different trigger to check for duplicates to 
>> avoid checking it too many times.
>>      a) do it only in debug builds. Performance would not be significant.
>>      b) do it also in product builds. Performance might be impacted on 
>> some other benchmarks/programs but I think not very likely.
>> 5a/b) Do the same without excluding WhiteBoxAPI and/or UseJVMCICompiler
> 
> There should be no duplicates in queue - we check for that:
> http://hg.openjdk.java.net/jdk/jdk/file/06745527c7b8/src/hotspot/share/compiler/compileBroker.cpp#l1120 
> 
> Unless we screwed up JVM_ACC_QUEUED bit setting.

I see, then it might not really be necessary to check for duplicates yet 
again.

> I think original assert was added in 8040798 to check that a task is 
> always put into _task_free_list when we finish compilation (or abort 
> compilation).

Thanks for clearing that up. I was not aware of that intention. I first 
thought it had only the purpose of finding strange things inside the 
compile queue.

> I don't think we should have different behavior (remove tasks from queue 
> or not put task in queue) in product and debug builds. If we do that we 
> have to do in both. We use mostly fastdebug build for testing - we 
> should execute the same code as in product as close as possible.

I agree with that. Doing it only in debug builds leads to a too 
different behavior.

> The real case is when all C2 compiler threads hangs (or takes very long 
> time to compile) and no progress is done on compiling other methods in 
> queue. But it should be very rare that all compiling threads hangs. And 
> we can catch such cases by other checks.
> > Based on all that I would go with 1).  As you pointed in bug report we
> not observing this assert anymore (only with hand made test).

Thank you for explaining it in more detail. When we have other checks 
that can detect such a situation where all compiling threads are hanging 
then we are probably fine by just removing that assert.

I updated my webrev with that option. I left the updated stress test there:
http://cr.openjdk.java.net/~chagedorn/8230402/webrev.02/

What do others think?

Best regards,
Christian

> Thanks,
> Vladimir
> 
>>
>> What do you think?
>>
>> Best regards,
>> Christian
>>
>>>>
>>>>>
>>>>> On 4/24/20 7:37 AM, Christian Hagedorn wrote:
>>>>>> Hi
>>>>>>
>>>>>> Please review the following patch:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230402
>>>>>> http://cr.openjdk.java.net/~chagedorn/8230402/webrev.00/
>>>>>>
>>>>>> This assert was hit very intermittently in an internal test until 
>>>>>> jdk-14+19. The test was changed afterwards and the assert was not 
>>>>>> observed to fail anymore. However, the problem of having too many 
>>>>>> tasks in the queue is still present (i.e. the compile queue is 
>>>>>> growing too quickly and the compiler(s) too slow to catch up). 
>>>>>> This assert can easily be hit by creating many class loaders which 
>>>>>> load many methods which are immediately compiled by setting a low 
>>>>>> compilation threshold as used in runA() in the testcase.
>>>>>>
>>>>>> Therefore, I suggest to tackle this problem with a general 
>>>>>> solution to drop half of the compilation tasks in 
>>>>>> CompileQueue::add() when a queue size of 10000 is reached and none 
>>>>>> of the other conditions of this assert hold (no Whitebox or JVMCI 
>>>>>> compiler). For tiered compilation, the tasks with the lowest 
>>>>>> method weight() or which are unloaded are removed from the queue 
>>>>>> (without altering the order of the remaining tasks in the queue). 
>>>>>> Without tiered compilation (i.e. SimpleCompPolicy), the tasks from 
>>>>>> the tail of the queue are removed. An additional verification in 
>>>>>> debug builds should ensure that there are no duplicated tasks. I 
>>>>>> assume that part of the reason of the original assert was to 
>>>>>> detect such duplicates.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best regards,
>>>>>> Christian
>>>>>>