[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Wed Feb 8 18:33:35 UTC 2017

Thank you for looking on this, Zoltan

 > Also, the failure with RandomAllocationTest appeared only after AOT was
 > pushed. Otherwise code cache management was not changed recently and
 > we've been running RandomAllocationTest for two years now. That makes it
 > likely that we would have hit this issue earlier.

Good point.

Looking on the RandomAllocationTest test and it uses Random to generate 
size of CodeBlob and indeed could be 0 (+ header size). So lets hope 
this change fix the problem.

Thanks,
Vladimir

On 2/8/17 4:39 AM, Zoltán Majó wrote:
> Hi Vladimir,
>
>
> On 02/07/2017 06:27 PM, Vladimir Kozlov wrote:
>> Hi, Zoltan
>>
>> Is it possible that the problem exist in free list maintenance code?
>> Ot allocation?
>>
>> Or free size at the end of heap calculated incorrectly when CodeBlob
>> is allocated there and it can cross boundary?
>
> I'm not sure. AOT did not change allocation / free list management code.
> It only changed CodeCache::get_code_heap() to use cb->code_begin()
> instead of cb directly.
>
> Also, the failure with RandomAllocationTest appeared only after AOT was
> pushed. Otherwise code cache management was not changed recently and
> we've been running RandomAllocationTest for two years now. That makes it
> likely that we would have hit this issue earlier.
>
>>
>> Can you look in core file that all 3 heaps have consistent start
>> address and size so that next heap start at correct address. Like
>> _number_of_committed_segments, _number_of_reserved_segments,
>> _next_segment
>>
>> I am concern that may be there is a code which check status of next
>> segment and use it regardless if it belong to current heap or not.
>
> Thank you for your suggestions!
>
> The test failed with -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and
> AOT disabled. So there are only two heaps. Here is the information for
> about both:
>
>
> (1) non-profiled nmethods heap
>
> CodeHeap 'non-profiled nmethods': size=238384Kb used=178496Kb
> max_used=224080Kb free=59887Kb
>  bounds [0x000000011b20c000, 0x0000000129ad8000, 0x0000000129ad8000]
>
> Length of segment: 0xE8CC000 = 0x1D1980 segments
>
> _next_segment:                          0x00000000001d13e6 -- < 0x1D1980
> so within current heap
> _number_of_committed_segments:          0x00000000001d1980 -- OK
> _number_of_reserved_segments:           0x00000000001d1980 -- OK
> _freelist_segments:                     0x00000000000749e1 -- < 0x1D1980
> so fits into current heap
> _freelist:                              0x000000011b20bf80 -- points
> into other code heap
>
>
> (2) non nmethod code heap
>
> CodeHeap 'non-nmethods': size=7376Kb used=5822Kb max_used=7376Kb
> free=1553Kb
>  bounds [0x000000011aad8000, 0x000000011b20c000, 0x000000011b20c000]
>
> Length of segment: 0x734000 = 0xe680 segments -- OK
>
> _next_segment:                          0x000000000000e680 -- size of
> current heap
> _number_of_committed_segments:          0x000000000000e680 -- size of
> current heap
> _number_of_reserved_segments:           0x000000000000e680 -- size of
> current heap
> _freelist_segments:                     0x000000000000308d -- < 0xe680
> so fits into current heap
> _freelist:                              0x000000011ada6780 -- points
> into the current code heap
>
>
> There are two (possibly) suspicious elements:
> - For (1), the _freelist points to into (2) -- we've discussed this before.
> - For (2), _next_segment, _number_of_committed_segments, and
> _number_of_reserved_segments is equal to the size of the code heap. That
> is, _next_segment of (2) points into (1).
>
> So I checked places where _next_segment is modified. Other than cases
> when _next_segment is initialized to 0, only CodeHeap::allocate()
> changes _next_segment.
>
> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/9b2ab23d5676/src/share/vm/memory/heap.cpp#l202
>
>
> But I don't think an allocation can cause the code heap to overflow
> there. An overflow would imply that
> - (a) the number of committed segments is equal to the size of the
> memory space reserved for the code heap (in segments) and
> - (b) the newly allocated block points to (or after)
> _number_of_committed_segments.
>
> We can return a block pointing to _number_of_committed_segments if
> _next_segment==_number_of_committed_segments and number_of_segments <= 0
> so that the condition
>
> _next_segment + number_of_segments <= _number_of_committed_segments
>
> holds. But we have an assert before that checks that number_of_segments
>>= sizeof(FreeBlock) (which is in turn > 0).
>
> Unfortunately, I don't see other ways the heap can become corrupted. But
> let's hope the guarantees the patch plans to add help catch this problem
> in the future (in case the patch does not already fix the problem).
>
> Thank you!
>
> Best regards,
>
>
> Zoltan
>
>
>>
>> Thanks,
>> Vladimir
>>
>> On 2/7/17 5:17 AM, Zoltán Majó wrote:
>>> Hi Dean,
>>> Hi Tobias,
>>>
>>>
>>> thank you for the feedback!
>>>
>>> On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
>>>> Hi Dean,
>>>>
>>>> On 07.02.2017 00:14, dean.long at oracle.com wrote:
>>>>> When do we allocate a CodeBlob with a code size of 0?  Is it really
>>>>> useful?  Would having a minimum code size of 1 fix the problem?
>>>> Yes, I've discussed that with Zoltan offline and I agree that we
>>>> should enforce a minimum code size of 1 and fix the stress test(s)
>>>> accordingly. It shouldn't be necessary to be able to create regular
>>>> CodeBlobs with code size 0.
>>>
>>> It is indeed a valid point that code sizes of 0 are not useful in
>>> practice.
>>>
>>> However, the problem is a bit more complicated than the regression test
>>> (and my description) suggests it to be.  The regression test works only
>>> if CodeCacheMinBlockLength set to 1.  Only that way is it possible to
>>> allocate BufferBlobs with size == 1 segment and set
>>> CodeBlob::_code_begin to point to the next segment (which is possibly in
>>> the next code heap).
>>>
>>> For both crashes I've seen, however, CodeCacheMinBlockLength was equal
>>> to 4.  With that CodeCacheMinBlockLength value, every allocation in the
>>> code heap is at least 4 segments long.  It is therefore impossible to
>>> create a BufferBlob whose _code_begin points "outside" of the space
>>> reserved for that BufferBlob.  (Because BufferBlob::_code_begin will
>>> either point to the beginning of second segment of the BufferBlob --
>>> with product -- or somewhere into the second segment -- with fastdebug.)
>>>
>>> Other blob types can have their _code_begin point to the boundary of 4
>>> segments.  E.g., nmethods are between 2 and 3 segments long.  If the
>>> relocation information in the nmethod fills up the space to the 4
>>> segment boundary, _code_begin will be exactly at the segment boundary.
>>>
>>> But I don't know the exact sequence of steps lead to the code heap
>>> corruption (because the crash reproduced only twice so far).  I just
>>> know that the first free block in the (corrupted) code heap is of length
>>> 8 segments (the contents of the first freelist item and of the segmap
>>> both confirm this); the heap is corrupted because the freelist starts
>>> before the code heap.
>>>
>>> What sequence of steps lead to this?  Some nmethod
>>> allocations/deallocations interleaved with BufferBlob
>>> allocations/deallocations (possibly of size 0)?  I'm not sure.  In what
>>> way were blocks in the heap merged/split by the
>>> allocations/deallocations leading to the crash?  I don't know.
>>>
>>> Did allocation/deallocation of BufferBlobs of size 0 play a role at
>>> all?  I don't know either and I won't be able to tell unless the problem
>>> reproduces more often.  However, by using code size == 0 it is possible
>>> to write a regression test that triggers a real problem (which is
>>> hopefully related to the crashes we've been seeing).
>>>
>>> So I'd like to allow code sizes of 0 for now and attempt to fix
>>> CodeCache::get_code_heap() and the CodeBlobIterator constructor the way
>>> Vladimir suggested.  If that turns out to be difficult/risky or require
>>> large code changes, I'll propose a fix instead that disallows code sizes
>>> of 1 and fixes the test(s) -- just as you suggested.
>>>
>>> Best regards,
>>>
>>>
>>> Zoltan
>>>
>>>> Best regards,
>>>> Tobias
>>>>
>>>>> dl
>>>>>
>>>>>
>>>>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> please review the fix for 8173151.
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>>
>>>>>> The crash reported in the bug is caused by the corruption of the
>>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>>>> heap points to an address one segment before the heap's address
>>>>>> space. The sweeper starts iterating through the code heap from the
>>>>>> beginning of the heap's address space. Thus, the sweeper assumes
>>>>>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>>>>>> the appropriate length and usage information). However, that is not
>>>>>> the case, as the first item in the heap is actually *before* the
>>>>>> heap. So the sweeper crashes.
>>>>>>
>>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>>> fix I propose is based on core file debugging and source code
>>>>>> investigation. But I managed to write a regression test that
>>>>>> triggers a problem similar to the original problem.
>>>>>>
>>>>>> I think that problem happens because a CodeBlob allocated in one
>>>>>> code heap (A) is returned to a different code heap (B). When the
>>>>>> CodeBlob is returned B, it is added to B's freelist. However, as the
>>>>>> CodeBlob was allocated in A, the freelist of B now points into A
>>>>>> (i.e., B is corrupted).
>>>>>>
>>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>>
>>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>>> FOR_ALL_HEAPS(heap) {
>>>>>> - if ((*heap)->contains(cb)) {
>>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>>> return *heap;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>>> - 'cb' is at the end code heap A and
>>>>>> - the size of the code contained in 'cb' is 0 (therefore
>>>>>> code_begin() returns the address after 'cb', i.e., the first address
>>>>>> in code heap B).
>>>>>>
>>>>>> The fix proposes to restore CodeCache::get_code_heap() to its
>>>>>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>>>>>> check). I also propose to add some guarantees after
>>>>>> allocation/deallocation in the code heap to possibly easier catch
>>>>>> this or related problems in the future.
>>>>>>
>>>>>> The regression test I propose achieves the above condition and
>>>>>> results in a crash. The regression test works only with product
>>>>>> builds, because in a product build a BufferBlob fits into one
>>>>>> segment whereas in a fastdebug build it does not.
>>>>>>
>>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>>>>>> flag is currently develop and we would need to make it product for
>>>>>> the test to work. (Other flags controlling the code cache, e.g.,
>>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>>> with reproducing the problem with different block lengths/segment
>>>>>> sizes, but that would most likely make the test more fragile (and
>>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>>
>>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>>
>>>>>> Zoltan
>>>>>>
>>>>>> [1]
>>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>>>
>>>
>