[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Fri Feb 10 07:24:02 UTC 2017

On 02/08/2017 07:33 PM, Vladimir Kozlov wrote:
> Thank you for looking on this, Zoltan
>
> > Also, the failure with RandomAllocationTest appeared only after AOT was
> > pushed. Otherwise code cache management was not changed recently and
> > we've been running RandomAllocationTest for two years now. That 
> makes it
> > likely that we would have hit this issue earlier.
>
> Good point.
>
> Looking on the RandomAllocationTest test and it uses Random to 
> generate size of CodeBlob and indeed could be 0 (+ header size). So 
> lets hope this change fix the problem.

Thank you, Vladimir!

Best regards,

Zoltan

>
> Thanks,
> Vladimir
>
> On 2/8/17 4:39 AM, Zoltán Majó wrote:
>> Hi Vladimir,
>>
>>
>> On 02/07/2017 06:27 PM, Vladimir Kozlov wrote:
>>> Hi, Zoltan
>>>
>>> Is it possible that the problem exist in free list maintenance code?
>>> Ot allocation?
>>>
>>> Or free size at the end of heap calculated incorrectly when CodeBlob
>>> is allocated there and it can cross boundary?
>>
>> I'm not sure. AOT did not change allocation / free list management code.
>> It only changed CodeCache::get_code_heap() to use cb->code_begin()
>> instead of cb directly.
>>
>> Also, the failure with RandomAllocationTest appeared only after AOT was
>> pushed. Otherwise code cache management was not changed recently and
>> we've been running RandomAllocationTest for two years now. That makes it
>> likely that we would have hit this issue earlier.
>>
>>>
>>> Can you look in core file that all 3 heaps have consistent start
>>> address and size so that next heap start at correct address. Like
>>> _number_of_committed_segments, _number_of_reserved_segments,
>>> _next_segment
>>>
>>> I am concern that may be there is a code which check status of next
>>> segment and use it regardless if it belong to current heap or not.
>>
>> Thank you for your suggestions!
>>
>> The test failed with -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and
>> AOT disabled. So there are only two heaps. Here is the information for
>> about both:
>>
>>
>> (1) non-profiled nmethods heap
>>
>> CodeHeap 'non-profiled nmethods': size=238384Kb used=178496Kb
>> max_used=224080Kb free=59887Kb
>>  bounds [0x000000011b20c000, 0x0000000129ad8000, 0x0000000129ad8000]
>>
>> Length of segment: 0xE8CC000 = 0x1D1980 segments
>>
>> _next_segment:                          0x00000000001d13e6 -- < 0x1D1980
>> so within current heap
>> _number_of_committed_segments:          0x00000000001d1980 -- OK
>> _number_of_reserved_segments:           0x00000000001d1980 -- OK
>> _freelist_segments:                     0x00000000000749e1 -- < 0x1D1980
>> so fits into current heap
>> _freelist:                              0x000000011b20bf80 -- points
>> into other code heap
>>
>>
>> (2) non nmethod code heap
>>
>> CodeHeap 'non-nmethods': size=7376Kb used=5822Kb max_used=7376Kb
>> free=1553Kb
>>  bounds [0x000000011aad8000, 0x000000011b20c000, 0x000000011b20c000]
>>
>> Length of segment: 0x734000 = 0xe680 segments -- OK
>>
>> _next_segment:                          0x000000000000e680 -- size of
>> current heap
>> _number_of_committed_segments:          0x000000000000e680 -- size of
>> current heap
>> _number_of_reserved_segments:           0x000000000000e680 -- size of
>> current heap
>> _freelist_segments:                     0x000000000000308d -- < 0xe680
>> so fits into current heap
>> _freelist:                              0x000000011ada6780 -- points
>> into the current code heap
>>
>>
>> There are two (possibly) suspicious elements:
>> - For (1), the _freelist points to into (2) -- we've discussed this 
>> before.
>> - For (2), _next_segment, _number_of_committed_segments, and
>> _number_of_reserved_segments is equal to the size of the code heap. That
>> is, _next_segment of (2) points into (1).
>>
>> So I checked places where _next_segment is modified. Other than cases
>> when _next_segment is initialized to 0, only CodeHeap::allocate()
>> changes _next_segment.
>>
>> http://hg.openjdk.java.net/jdk9/hs/hotspot/file/9b2ab23d5676/src/share/vm/memory/heap.cpp#l202 
>>
>>
>>
>> But I don't think an allocation can cause the code heap to overflow
>> there. An overflow would imply that
>> - (a) the number of committed segments is equal to the size of the
>> memory space reserved for the code heap (in segments) and
>> - (b) the newly allocated block points to (or after)
>> _number_of_committed_segments.
>>
>> We can return a block pointing to _number_of_committed_segments if
>> _next_segment==_number_of_committed_segments and number_of_segments <= 0
>> so that the condition
>>
>> _next_segment + number_of_segments <= _number_of_committed_segments
>>
>> holds. But we have an assert before that checks that number_of_segments
>>> = sizeof(FreeBlock) (which is in turn > 0).
>>
>> Unfortunately, I don't see other ways the heap can become corrupted. But
>> let's hope the guarantees the patch plans to add help catch this problem
>> in the future (in case the patch does not already fix the problem).
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 2/7/17 5:17 AM, Zoltán Majó wrote:
>>>> Hi Dean,
>>>> Hi Tobias,
>>>>
>>>>
>>>> thank you for the feedback!
>>>>
>>>> On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
>>>>> Hi Dean,
>>>>>
>>>>> On 07.02.2017 00:14, dean.long at oracle.com wrote:
>>>>>> When do we allocate a CodeBlob with a code size of 0?  Is it really
>>>>>> useful?  Would having a minimum code size of 1 fix the problem?
>>>>> Yes, I've discussed that with Zoltan offline and I agree that we
>>>>> should enforce a minimum code size of 1 and fix the stress test(s)
>>>>> accordingly. It shouldn't be necessary to be able to create regular
>>>>> CodeBlobs with code size 0.
>>>>
>>>> It is indeed a valid point that code sizes of 0 are not useful in
>>>> practice.
>>>>
>>>> However, the problem is a bit more complicated than the regression 
>>>> test
>>>> (and my description) suggests it to be.  The regression test works 
>>>> only
>>>> if CodeCacheMinBlockLength set to 1.  Only that way is it possible to
>>>> allocate BufferBlobs with size == 1 segment and set
>>>> CodeBlob::_code_begin to point to the next segment (which is 
>>>> possibly in
>>>> the next code heap).
>>>>
>>>> For both crashes I've seen, however, CodeCacheMinBlockLength was equal
>>>> to 4.  With that CodeCacheMinBlockLength value, every allocation in 
>>>> the
>>>> code heap is at least 4 segments long.  It is therefore impossible to
>>>> create a BufferBlob whose _code_begin points "outside" of the space
>>>> reserved for that BufferBlob.  (Because BufferBlob::_code_begin will
>>>> either point to the beginning of second segment of the BufferBlob --
>>>> with product -- or somewhere into the second segment -- with 
>>>> fastdebug.)
>>>>
>>>> Other blob types can have their _code_begin point to the boundary of 4
>>>> segments.  E.g., nmethods are between 2 and 3 segments long.  If the
>>>> relocation information in the nmethod fills up the space to the 4
>>>> segment boundary, _code_begin will be exactly at the segment boundary.
>>>>
>>>> But I don't know the exact sequence of steps lead to the code heap
>>>> corruption (because the crash reproduced only twice so far).  I just
>>>> know that the first free block in the (corrupted) code heap is of 
>>>> length
>>>> 8 segments (the contents of the first freelist item and of the segmap
>>>> both confirm this); the heap is corrupted because the freelist starts
>>>> before the code heap.
>>>>
>>>> What sequence of steps lead to this?  Some nmethod
>>>> allocations/deallocations interleaved with BufferBlob
>>>> allocations/deallocations (possibly of size 0)?  I'm not sure.  In 
>>>> what
>>>> way were blocks in the heap merged/split by the
>>>> allocations/deallocations leading to the crash?  I don't know.
>>>>
>>>> Did allocation/deallocation of BufferBlobs of size 0 play a role at
>>>> all?  I don't know either and I won't be able to tell unless the 
>>>> problem
>>>> reproduces more often.  However, by using code size == 0 it is 
>>>> possible
>>>> to write a regression test that triggers a real problem (which is
>>>> hopefully related to the crashes we've been seeing).
>>>>
>>>> So I'd like to allow code sizes of 0 for now and attempt to fix
>>>> CodeCache::get_code_heap() and the CodeBlobIterator constructor the 
>>>> way
>>>> Vladimir suggested.  If that turns out to be difficult/risky or 
>>>> require
>>>> large code changes, I'll propose a fix instead that disallows code 
>>>> sizes
>>>> of 1 and fixes the test(s) -- just as you suggested.
>>>>
>>>> Best regards,
>>>>
>>>>
>>>> Zoltan
>>>>
>>>>> Best regards,
>>>>> Tobias
>>>>>
>>>>>> dl
>>>>>>
>>>>>>
>>>>>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> please review the fix for 8173151.
>>>>>>>
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>>>
>>>>>>> The crash reported in the bug is caused by the corruption of the
>>>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that 
>>>>>>> code
>>>>>>> heap points to an address one segment before the heap's address
>>>>>>> space. The sweeper starts iterating through the code heap from the
>>>>>>> beginning of the heap's address space. Thus, the sweeper assumes
>>>>>>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>>>>>>> the appropriate length and usage information). However, that is not
>>>>>>> the case, as the first item in the heap is actually *before* the
>>>>>>> heap. So the sweeper crashes.
>>>>>>>
>>>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>>>> fix I propose is based on core file debugging and source code
>>>>>>> investigation. But I managed to write a regression test that
>>>>>>> triggers a problem similar to the original problem.
>>>>>>>
>>>>>>> I think that problem happens because a CodeBlob allocated in one
>>>>>>> code heap (A) is returned to a different code heap (B). When the
>>>>>>> CodeBlob is returned B, it is added to B's freelist. However, as 
>>>>>>> the
>>>>>>> CodeBlob was allocated in A, the freelist of B now points into A
>>>>>>> (i.e., B is corrupted).
>>>>>>>
>>>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>>>
>>>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>>>> FOR_ALL_HEAPS(heap) {
>>>>>>> - if ((*heap)->contains(cb)) {
>>>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>>>> return *heap;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>>>> - 'cb' is at the end code heap A and
>>>>>>> - the size of the code contained in 'cb' is 0 (therefore
>>>>>>> code_begin() returns the address after 'cb', i.e., the first 
>>>>>>> address
>>>>>>> in code heap B).
>>>>>>>
>>>>>>> The fix proposes to restore CodeCache::get_code_heap() to its
>>>>>>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>>>>>>> check). I also propose to add some guarantees after
>>>>>>> allocation/deallocation in the code heap to possibly easier catch
>>>>>>> this or related problems in the future.
>>>>>>>
>>>>>>> The regression test I propose achieves the above condition and
>>>>>>> results in a crash. The regression test works only with product
>>>>>>> builds, because in a product build a BufferBlob fits into one
>>>>>>> segment whereas in a fastdebug build it does not.
>>>>>>>
>>>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>>>>>>> flag is currently develop and we would need to make it product for
>>>>>>> the test to work. (Other flags controlling the code cache, e.g.,
>>>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>>>> with reproducing the problem with different block lengths/segment
>>>>>>> sizes, but that would most likely make the test more fragile (and
>>>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>>>
>>>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>>
>>>>>>> Zoltan
>>>>>>>
>>>>>>> [1]
>>>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71 
>>>>>>>
>>>>>>>
>>>>
>>