[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Tue Feb 7 18:08:49 UTC 2017

Yes, this looks good.

I think it is good to push this first and see if we hit this problem 
again later. The problem could be related to segmented codecache and how 
we manipulate free and used segments. So the fix may not solve it but at 
least it will close one issue you found.

Thanks,
Vladimir

On 2/7/17 5:17 AM, Zoltán Majó wrote:
> Hi Vladimir,
>
>
> thank you for the feedback!
>
> On 02/06/2017 09:22 PM, Vladimir Kozlov wrote:
>> On 2/6/17 9:18 AM, Zoltán Majó wrote:
>>> [...]
>>>
>>> As we discussed offline, it seems that AOTCompiledMethods are allocated
>>> on the C heap (instead in the code cache). That is, only the code of an
>>> AOT method is allocated in the code cache. Could somebody from the AOT
>>> team please confirm that?
>>
>> Yes, CodeBlob descriptors of AOT methods are allocated in C heap since
>> AOT code is in RO memory (in loaded AOT library) and we can't put
>> modifiable (addresses are unknown during AOT lib generation) CodeBlob
>> there.
>>
>> It is different from nmethods where CodeBlob is "wrapper" of nmethod
>> code in CodeCache. That is why we have to use cb->code_begin().
>
> Thank you for clarifying.
>
>>
>> Sorry but your changes for get_code_heap() are incorrect for AOT code.
>>
>> There is an other similar code which could be problematic -
>> CodeBlobIterator:
>>
>> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/8d77eb1a669d/src/share/vm/code/codeCache.hpp#l298
>>
>
> No problem, thank you for pointing me to the other code location.
>
>>
>> Note, CodeHeap::contains() is virtual and aotCodeHeap overwrite it but
>> still accept an address. Based on usage cases (how many) we may change
>> it to accept CodeBlob type instead of address and do appropriate
>> checks depending on AOT or not AOT heap. Or we add an other CodeHeap
>> method contains_blob(CodeBlob *) which do that.
>
> OK, I added the CodeHeap::contains_blob(CodeBlob*) virtual method, as
> you suggested. I updated both CodeCache::get_code_heap() and
> CodeBlobIterator::CodeBlobIterator() to use that method instead of the
> CodeHeap::contains(void*) method (which is also virtual, so we did not
> increase the number of virtual calls by using contains_blob()).
>
> Here is the updated webrev:
> http://cr.openjdk.java.net/~zmajo/8173151/webrev.02/
>
>>
>>>
>>> If the above is true, we could add a virtual method
>>> CodeBlob::blob_begin() to CodeBlob, which would return 'this'.
>>> AOTCompiledMethod would then override blob_begin() to return
>>> code_begin(). In CodeCache::get_code_heap() we would then use
>>> ...contains(cb->blob_begin()). Would that make sense? (I'm sure we can
>>> find some better for that method.)
>>
>> We worked very hard to avoid virtual calls in AOT code since we saw
>> performance regression when scanning code cache. So we can't do
>> virtual calls.
>
> I see. The solution in webrev.02 above does not increase the number of
> virtual calls (unless I miss something, of course).
>
> I tested the new patch with JPRT, all tests pass. I'll start RBT soon.
>
> Thank you!
>
> Best regards,
>
>
> Zoltan
>
>>
>> Thanks,
>> Vladimir
>>
>>>
>>>> - I'm not sure if we should make CodeCacheMinBlockLength a product
>>>> flag because we would then also need more testing with different flag
>>>> values. What about making it diagnostic or experimental?
>>>
>>> Let's make it diagnostic then.
>>>
>>> Here is the updated webrev:
>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.01/
>>>
>>> JPRT testing is in progress.
>>>
>>> Thank you!
>>>
>>> Best regards,
>>>
>>>
>>> Zoltan
>>>
>>>>
>>>> Thanks,
>>>> Tobias
>>>>
>>>> On 06.02.2017 16:29, Zoltán Majó wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> please review the fix for 8173151.
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>
>>>>> The crash reported in the bug is caused by the corruption of the
>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>>> heap points to an address one segment before the heap's address
>>>>> space. The sweeper starts iterating through the code heap from the
>>>>> beginning of the heap's address space. Thus, the sweeper assumes that
>>>>> the first item in the code heap is a HeapBlock/FreeBlock (with the
>>>>> appropriate length and usage information). However, that is not the
>>>>> case, as the first item in the heap is actually *before* the heap. So
>>>>> the sweeper crashes.
>>>>>
>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>> fix I propose is based on core file debugging and source code
>>>>> investigation. But I managed to write a regression test that triggers
>>>>> a problem similar to the original problem.
>>>>>
>>>>> I think that problem happens because a CodeBlob allocated in one code
>>>>> heap (A) is returned to a different code heap (B). When the CodeBlob
>>>>> is returned B, it is added to B's freelist. However, as the CodeBlob
>>>>> was allocated in A, the freelist of B now points into A (i.e., B is
>>>>> corrupted).
>>>>>
>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>
>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>> FOR_ALL_HEAPS(heap) {
>>>>> - if ((*heap)->contains(cb)) {
>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>> return *heap;
>>>>> }
>>>>> }
>>>>>
>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>> - 'cb' is at the end code heap A and
>>>>> - the size of the code contained in 'cb' is 0 (therefore code_begin()
>>>>> returns the address after 'cb', i.e., the first address in code
>>>>> heap B).
>>>>>
>>>>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT
>>>>> state (as I'm not aware of the reasons why AOT changed that check). I
>>>>> also propose to add some guarantees after allocation/deallocation in
>>>>> the code heap to possibly easier catch this or related problems in
>>>>> the future.
>>>>>
>>>>> The regression test I propose achieves the above condition and
>>>>> results in a crash. The regression test works only with product
>>>>> builds, because in a product build a BufferBlob fits into one segment
>>>>> whereas in a fastdebug build it does not.
>>>>>
>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag
>>>>> is currently develop and we would need to make it product for the
>>>>> test to work. (Other flags controlling the code cache, e.g.,
>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>> with reproducing the problem with different block lengths/segment
>>>>> sizes, but that would most likely make the test more fragile (and
>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>
>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Zoltan
>>>>>
>>>>> [1]
>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>>
>>>
>