[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Wed Feb 8 09:18:29 UTC 2017

Hi Vladimir,

On 02/07/2017 07:08 PM, Vladimir Kozlov wrote:
> Yes, this looks good.

thank you (and also Dean and Tobias) for the review.

>
> I think it is good to push this first and see if we hit this problem 
> again later. The problem could be related to segmented codecache and 
> how we manipulate free and used segments. So the fix may not solve it 
> but at least it will close one issue you found.

OK, I'll push this once I have the results of RBT testing available and 
they look good.

I'm also wondering if we hit this (or a similar problem) later. Maybe 
the guarantees added with the patch will help to get a better 
understanding of the problem later on.

Thank you!

Best regards,

Zoltan

>
> Thanks,
> Vladimir
>
> On 2/7/17 5:17 AM, Zoltán Majó wrote:
>> Hi Vladimir,
>>
>>
>> thank you for the feedback!
>>
>> On 02/06/2017 09:22 PM, Vladimir Kozlov wrote:
>>> On 2/6/17 9:18 AM, Zoltán Majó wrote:
>>>> [...]
>>>>
>>>> As we discussed offline, it seems that AOTCompiledMethods are 
>>>> allocated
>>>> on the C heap (instead in the code cache). That is, only the code 
>>>> of an
>>>> AOT method is allocated in the code cache. Could somebody from the AOT
>>>> team please confirm that?
>>>
>>> Yes, CodeBlob descriptors of AOT methods are allocated in C heap since
>>> AOT code is in RO memory (in loaded AOT library) and we can't put
>>> modifiable (addresses are unknown during AOT lib generation) CodeBlob
>>> there.
>>>
>>> It is different from nmethods where CodeBlob is "wrapper" of nmethod
>>> code in CodeCache. That is why we have to use cb->code_begin().
>>
>> Thank you for clarifying.
>>
>>>
>>> Sorry but your changes for get_code_heap() are incorrect for AOT code.
>>>
>>> There is an other similar code which could be problematic -
>>> CodeBlobIterator:
>>>
>>> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/8d77eb1a669d/src/share/vm/code/codeCache.hpp#l298 
>>>
>>>
>>
>> No problem, thank you for pointing me to the other code location.
>>
>>>
>>> Note, CodeHeap::contains() is virtual and aotCodeHeap overwrite it but
>>> still accept an address. Based on usage cases (how many) we may change
>>> it to accept CodeBlob type instead of address and do appropriate
>>> checks depending on AOT or not AOT heap. Or we add an other CodeHeap
>>> method contains_blob(CodeBlob *) which do that.
>>
>> OK, I added the CodeHeap::contains_blob(CodeBlob*) virtual method, as
>> you suggested. I updated both CodeCache::get_code_heap() and
>> CodeBlobIterator::CodeBlobIterator() to use that method instead of the
>> CodeHeap::contains(void*) method (which is also virtual, so we did not
>> increase the number of virtual calls by using contains_blob()).
>>
>> Here is the updated webrev:
>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.02/
>>
>>>
>>>>
>>>> If the above is true, we could add a virtual method
>>>> CodeBlob::blob_begin() to CodeBlob, which would return 'this'.
>>>> AOTCompiledMethod would then override blob_begin() to return
>>>> code_begin(). In CodeCache::get_code_heap() we would then use
>>>> ...contains(cb->blob_begin()). Would that make sense? (I'm sure we can
>>>> find some better for that method.)
>>>
>>> We worked very hard to avoid virtual calls in AOT code since we saw
>>> performance regression when scanning code cache. So we can't do
>>> virtual calls.
>>
>> I see. The solution in webrev.02 above does not increase the number of
>> virtual calls (unless I miss something, of course).
>>
>> I tested the new patch with JPRT, all tests pass. I'll start RBT soon.
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>>
>>>>> - I'm not sure if we should make CodeCacheMinBlockLength a product
>>>>> flag because we would then also need more testing with different flag
>>>>> values. What about making it diagnostic or experimental?
>>>>
>>>> Let's make it diagnostic then.
>>>>
>>>> Here is the updated webrev:
>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.01/
>>>>
>>>> JPRT testing is in progress.
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>>
>>>>
>>>> Zoltan
>>>>
>>>>>
>>>>> Thanks,
>>>>> Tobias
>>>>>
>>>>> On 06.02.2017 16:29, Zoltán Majó wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> please review the fix for 8173151.
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>>
>>>>>> The crash reported in the bug is caused by the corruption of the
>>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>>>> heap points to an address one segment before the heap's address
>>>>>> space. The sweeper starts iterating through the code heap from the
>>>>>> beginning of the heap's address space. Thus, the sweeper assumes 
>>>>>> that
>>>>>> the first item in the code heap is a HeapBlock/FreeBlock (with the
>>>>>> appropriate length and usage information). However, that is not the
>>>>>> case, as the first item in the heap is actually *before* the 
>>>>>> heap. So
>>>>>> the sweeper crashes.
>>>>>>
>>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>>> fix I propose is based on core file debugging and source code
>>>>>> investigation. But I managed to write a regression test that 
>>>>>> triggers
>>>>>> a problem similar to the original problem.
>>>>>>
>>>>>> I think that problem happens because a CodeBlob allocated in one 
>>>>>> code
>>>>>> heap (A) is returned to a different code heap (B). When the CodeBlob
>>>>>> is returned B, it is added to B's freelist. However, as the CodeBlob
>>>>>> was allocated in A, the freelist of B now points into A (i.e., B is
>>>>>> corrupted).
>>>>>>
>>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>>
>>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>>> FOR_ALL_HEAPS(heap) {
>>>>>> - if ((*heap)->contains(cb)) {
>>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>>> return *heap;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>>> - 'cb' is at the end code heap A and
>>>>>> - the size of the code contained in 'cb' is 0 (therefore 
>>>>>> code_begin()
>>>>>> returns the address after 'cb', i.e., the first address in code
>>>>>> heap B).
>>>>>>
>>>>>> The fix proposes to restore CodeCache::get_code_heap() to its 
>>>>>> pre-AOT
>>>>>> state (as I'm not aware of the reasons why AOT changed that 
>>>>>> check). I
>>>>>> also propose to add some guarantees after allocation/deallocation in
>>>>>> the code heap to possibly easier catch this or related problems in
>>>>>> the future.
>>>>>>
>>>>>> The regression test I propose achieves the above condition and
>>>>>> results in a crash. The regression test works only with product
>>>>>> builds, because in a product build a BufferBlob fits into one 
>>>>>> segment
>>>>>> whereas in a fastdebug build it does not.
>>>>>>
>>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The 
>>>>>> flag
>>>>>> is currently develop and we would need to make it product for the
>>>>>> test to work. (Other flags controlling the code cache, e.g.,
>>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>>> with reproducing the problem with different block lengths/segment
>>>>>> sizes, but that would most likely make the test more fragile (and
>>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>>
>>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> Zoltan
>>>>>>
>>>>>> [1]
>>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71 
>>>>>>
>>>>>>
>>>>
>>