RFR: 8213486: SIGSEGV in CompiledMethod::cleanup_inline_caches_impl with AOT

Wed Nov 21 18:33:03 UTC 2018

Sounds good.

Vladimir

On 11/21/18 10:21 AM, Erik Osterlund wrote:
> Hi Vladimir,
> 
> Thank you for the background. It sounds like we agree about the solution.
> 
> Note though as I replied to Dean that we do need to call do_unloading on all AOT methods seen in iterators to clean their IC caches referring to unloading nmethods. Seems like the runtime stub AOT methods are hidden from iterators (according to an assert I added), and hence go under the radar due to not having IC caches.
> 
> But still the solution is to just return false for is_unloading() on AOT methods and move the state fiddling to nmethod.
> 
> /Erik
> 
>> On 21 Nov 2018, at 18:49, Vladimir Kozlov <vladimir.kozlov at oracle.com> wrote:
>>
>>  From the start AOT methods were never scanned for unloading because they don't have embedded oops.
>> We don't need to explicitly "unload" aot methods to avoid GC to scan them - they don't have oops. Oops and metadata referenced by aot methods are kept in separate arrays. Scanning aot oops and metadata arrays are done by separate code.
>> At least this is what I remember. If you look on original aot unloading code - it did nothing:
>> http://hg.openjdk.java.net/jdk/jdk11/file/1ddf9a99e4ad/src/hotspot/share/aot/aotCompiledMethod.cpp#l78
>>
>> Based on that I think it is correct to move unloading logic to nmethod as you suggested.
>>
>> I still can't remember how we proved that should not be any "activations"/call to such aot methods if they are not marked as 'unloaded'.
>>
>> Vladimir
>>
>>> On 11/21/18 8:32 AM, dean.long at oracle.com wrote:
>>> I think the problem is CompiledMethodIterator vs NMethodIterator. CompiledMethodIterator will visit AOT methods, but NMethodIterator will not.  As long as AOT methods can't be unloaded, it makes sense to continue to use NMethodIterator and move unloading logic from CompiledMethod to nmethod.
>>> dl
>>>> On 11/21/18 6:26 AM, Erik Österlund wrote:
>>>> Hi Vladimir,
>>>>
>>>> Thank you for the explanation.
>>>>
>>>> I inserted some debugging code to be able to draw better conclusions.
>>>> I put in a bool that tags CompiledMethods that are visited during code cache unloading.
>>>>
>>>> When I crash, I find the AOTCompiledMethod in the following state:
>>>>
>>>> * The CodeCache::_unloading_cycle is 2, meaning there has been at least one unloading that occurred since the VM started.
>>>> * My new boolean saying if it was found during code cache walking said false; it has never been observed by GC code cache unloading.
>>>> * The _is_unloading_state of the AOT method is 2, meaning it was from CodeCache::_unloading_cycle number 1. So there has been at least one code cache unloading cycle since the AOT method was created.
>>>> * It is in the CompiledMethod state in_use.
>>>> * The corresponding CodeToAMethod entry has _state equal to in_use, and the _aot pointer points right back to the AOT method as expected.
>>>>
>>>> So the conclusions I can draw from this is that we are looking at an AOTCompiledMethod that is in_use, alive, and published, but was created at least one unloading cycle earlier, yet was *not* found by the evident code cache unloading that has occurred between creating the AOT method and the crash. In fact it has not been found in any code cache unloading event at all.
>>>>
>>>> So I am certain about the problem being that the AOT method is not found during our code cache walk, and that the solution to that problem is to move the epoch business to nmethod instead. However, it would be desirable to know exactly why the AOT method isn't showing up in the code cache walk, but I'm afraid I can't see right now exactly why it has been missed out, only that it definitely has been missed out. :/
>>>>
>>>> /Erik
>>>>
>>>>> On 2018-11-20 19:08, Vladimir Kozlov wrote:
>>>>> To clarify.
>>>>>
>>>>> 'invalid' state is set only for AOT methods which never was registers (their state is 'not_set' and corresponding class was not initialized yet). Such methods are never visible to iterators and IC caches.
>>>>>
>>>>> If aot method is registered its state become 'in_use' and in such case it is treated and processed as normal nmethod. And AOTCompiledMethod structure is created only for such case.
>>>>>
>>>>> Vladimir
>>>>>
>>>>>> On 11/20/18 3:12 AM, Erik Österlund wrote:
>>>>>> Hi Dean,
>>>>>>
>>>>>> Yeah I think I misunderstood what I observed. So what I know is that in the code cache unloading, I'm not getting all is_alive() AOT methods into my iterator, which messes up the assumptions made by the epoch based scheme for AOT methods.
>>>>>>
>>>>>> I noticed that in AOTCodeHeap::sweep_dependent_methods(int* indexes, int methods_cnt) we make AOTCompiledMethods "invalid" in the AOT heap, making it no longer observable from the iterators. Then it calls the VM_Deoptimize vm operation after. Throughout all this, the AOTCompiledMethod is alive(), yet when the iterators ask for all is_alive() AOTCompiledMethods, it won't be visible. But I suppose IC caches may still reference these methods and check if it is_unloading, and then we blow up. There may possibly be multiple ways for is_alive() AOTCompiledMethods to not be visible from iterators yet be visible through IC caches using the "invalid" state in the .
>>>>>>
>>>>>> Anyway, the fix is the same: stop doing the epoch state thingey for is_unloading() on AOTCompiledMethod where it isn't needed, and doesn't seem to play well with the rather different life cycle it has, and just return false instead.
>>>>>>
>>>>>> Thanks,
>>>>>> /Erik
>>>>>>
>>>>>>> On 2018-11-20 00:00, dean.long at oracle.com wrote:
>>>>>>> Hi Erik,
>>>>>>>
>>>>>>>> On 11/19/18 12:42 PM, Erik Österlund wrote:
>>>>>>>> ...except it looks like for AOTCompiledMethods when running with tiered compilation, may first be is_alive(), then become !is_alive() for a while, and then get resurrected to is_alive() using make_entrant().
>>>>>>>
>>>>>>> this doesn't sounds quite right.  AOTCompiledMethods aren't allowed to transition to zombie (!alive), only not_used or not_entrant, which should still have is_alive() returning true. Maybe some code is using is_not_entrant() instead of !is_alive()?
>>>>>>>
>>>>>>> dl
>>>>>>
>>>>
>