[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Tue Feb 7 13:17:36 UTC 2017

Hi Vladimir,

thank you for the feedback!

On 02/06/2017 09:22 PM, Vladimir Kozlov wrote:
> On 2/6/17 9:18 AM, Zoltán Majó wrote:
>> [...]
>>
>> As we discussed offline, it seems that AOTCompiledMethods are allocated
>> on the C heap (instead in the code cache). That is, only the code of an
>> AOT method is allocated in the code cache. Could somebody from the AOT
>> team please confirm that?
>
> Yes, CodeBlob descriptors of AOT methods are allocated in C heap since 
> AOT code is in RO memory (in loaded AOT library) and we can't put 
> modifiable (addresses are unknown during AOT lib generation) CodeBlob 
> there.
>
> It is different from nmethods where CodeBlob is "wrapper" of nmethod 
> code in CodeCache. That is why we have to use cb->code_begin().

Thank you for clarifying.

>
> Sorry but your changes for get_code_heap() are incorrect for AOT code.
>
> There is an other similar code which could be problematic - 
> CodeBlobIterator:
>
> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/8d77eb1a669d/src/share/vm/code/codeCache.hpp#l298 
>

No problem, thank you for pointing me to the other code location.

>
> Note, CodeHeap::contains() is virtual and aotCodeHeap overwrite it but 
> still accept an address. Based on usage cases (how many) we may change 
> it to accept CodeBlob type instead of address and do appropriate 
> checks depending on AOT or not AOT heap. Or we add an other CodeHeap 
> method contains_blob(CodeBlob *) which do that.

OK, I added the CodeHeap::contains_blob(CodeBlob*) virtual method, as 
you suggested. I updated both CodeCache::get_code_heap() and 
CodeBlobIterator::CodeBlobIterator() to use that method instead of the 
CodeHeap::contains(void*) method (which is also virtual, so we did not 
increase the number of virtual calls by using contains_blob()).

Here is the updated webrev:
http://cr.openjdk.java.net/~zmajo/8173151/webrev.02/

>
>>
>> If the above is true, we could add a virtual method
>> CodeBlob::blob_begin() to CodeBlob, which would return 'this'.
>> AOTCompiledMethod would then override blob_begin() to return
>> code_begin(). In CodeCache::get_code_heap() we would then use
>> ...contains(cb->blob_begin()). Would that make sense? (I'm sure we can
>> find some better for that method.)
>
> We worked very hard to avoid virtual calls in AOT code since we saw 
> performance regression when scanning code cache. So we can't do 
> virtual calls.

I see. The solution in webrev.02 above does not increase the number of 
virtual calls (unless I miss something, of course).

I tested the new patch with JPRT, all tests pass. I'll start RBT soon.

Thank you!

Best regards,

Zoltan

>
> Thanks,
> Vladimir
>
>>
>>> - I'm not sure if we should make CodeCacheMinBlockLength a product
>>> flag because we would then also need more testing with different flag
>>> values. What about making it diagnostic or experimental?
>>
>> Let's make it diagnostic then.
>>
>> Here is the updated webrev:
>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.01/
>>
>> JPRT testing is in progress.
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>>
>>> Thanks,
>>> Tobias
>>>
>>> On 06.02.2017 16:29, Zoltán Majó wrote:
>>>> Hi,
>>>>
>>>>
>>>> please review the fix for 8173151.
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>
>>>> The crash reported in the bug is caused by the corruption of the
>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>> heap points to an address one segment before the heap's address
>>>> space. The sweeper starts iterating through the code heap from the
>>>> beginning of the heap's address space. Thus, the sweeper assumes that
>>>> the first item in the code heap is a HeapBlock/FreeBlock (with the
>>>> appropriate length and usage information). However, that is not the
>>>> case, as the first item in the heap is actually *before* the heap. So
>>>> the sweeper crashes.
>>>>
>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>> fix I propose is based on core file debugging and source code
>>>> investigation. But I managed to write a regression test that triggers
>>>> a problem similar to the original problem.
>>>>
>>>> I think that problem happens because a CodeBlob allocated in one code
>>>> heap (A) is returned to a different code heap (B). When the CodeBlob
>>>> is returned B, it is added to B's freelist. However, as the CodeBlob
>>>> was allocated in A, the freelist of B now points into A (i.e., B is
>>>> corrupted).
>>>>
>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>> Since 8171008 (AOT) [1], the check is:
>>>>
>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>> assert(cb != NULL, "CodeBlob is null");
>>>> FOR_ALL_HEAPS(heap) {
>>>> - if ((*heap)->contains(cb)) {
>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>> return *heap;
>>>> }
>>>> }
>>>>
>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>> - 'cb' is at the end code heap A and
>>>> - the size of the code contained in 'cb' is 0 (therefore code_begin()
>>>> returns the address after 'cb', i.e., the first address in code 
>>>> heap B).
>>>>
>>>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT
>>>> state (as I'm not aware of the reasons why AOT changed that check). I
>>>> also propose to add some guarantees after allocation/deallocation in
>>>> the code heap to possibly easier catch this or related problems in
>>>> the future.
>>>>
>>>> The regression test I propose achieves the above condition and
>>>> results in a crash. The regression test works only with product
>>>> builds, because in a product build a BufferBlob fits into one segment
>>>> whereas in a fastdebug build it does not.
>>>>
>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag
>>>> is currently develop and we would need to make it product for the
>>>> test to work. (Other flags controlling the code cache, e.g.,
>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>> with reproducing the problem with different block lengths/segment
>>>> sizes, but that would most likely make the test more fragile (and
>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>
>>>> I tested the fix with JPRT, RBT is in progress.
>>>>
>>>> Thank you!
>>>>
>>>> Best regards,
>>>>
>>>>
>>>> Zoltan
>>>>
>>>> [1]
>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>
>>