[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test

Wed Feb 8 12:39:35 UTC 2017

Hi Vladimir,

On 02/07/2017 06:27 PM, Vladimir Kozlov wrote:
> Hi, Zoltan
>
> Is it possible that the problem exist in free list maintenance code? 
> Ot allocation?
>
> Or free size at the end of heap calculated incorrectly when CodeBlob 
> is allocated there and it can cross boundary?

I'm not sure. AOT did not change allocation / free list management code. 
It only changed CodeCache::get_code_heap() to use cb->code_begin() 
instead of cb directly.

Also, the failure with RandomAllocationTest appeared only after AOT was 
pushed. Otherwise code cache management was not changed recently and 
we've been running RandomAllocationTest for two years now. That makes it 
likely that we would have hit this issue earlier.

>
> Can you look in core file that all 3 heaps have consistent start 
> address and size so that next heap start at correct address. Like 
> _number_of_committed_segments, _number_of_reserved_segments, 
> _next_segment
>
> I am concern that may be there is a code which check status of next 
> segment and use it regardless if it belong to current heap or not.

Thank you for your suggestions!

The test failed with -XX:+TieredCompilation -XX:TieredStopAtLevel=1 and 
AOT disabled. So there are only two heaps. Here is the information for 
about both:

(1) non-profiled nmethods heap

CodeHeap 'non-profiled nmethods': size=238384Kb used=178496Kb 
max_used=224080Kb free=59887Kb
  bounds [0x000000011b20c000, 0x0000000129ad8000, 0x0000000129ad8000]

Length of segment: 0xE8CC000 = 0x1D1980 segments

_next_segment:                          0x00000000001d13e6 -- < 0x1D1980 
so within current heap
_number_of_committed_segments:          0x00000000001d1980 -- OK
_number_of_reserved_segments:           0x00000000001d1980 -- OK
_freelist_segments:                     0x00000000000749e1 -- < 0x1D1980 
so fits into current heap
_freelist:                              0x000000011b20bf80 -- points 
into other code heap

(2) non nmethod code heap

CodeHeap 'non-nmethods': size=7376Kb used=5822Kb max_used=7376Kb free=1553Kb
  bounds [0x000000011aad8000, 0x000000011b20c000, 0x000000011b20c000]

Length of segment: 0x734000 = 0xe680 segments -- OK

_next_segment:                          0x000000000000e680 -- size of 
current heap
_number_of_committed_segments:          0x000000000000e680 -- size of 
current heap
_number_of_reserved_segments:           0x000000000000e680 -- size of 
current heap
_freelist_segments:                     0x000000000000308d -- < 0xe680 
so fits into current heap
_freelist:                              0x000000011ada6780 -- points 
into the current code heap

There are two (possibly) suspicious elements:
- For (1), the _freelist points to into (2) -- we've discussed this before.
- For (2), _next_segment, _number_of_committed_segments, and 
_number_of_reserved_segments is equal to the size of the code heap. That 
is, _next_segment of (2) points into (1).

So I checked places where _next_segment is modified. Other than cases 
when _next_segment is initialized to 0, only CodeHeap::allocate() 
changes _next_segment.

http://hg.openjdk.java.net/jdk9/hs/hotspot/file/9b2ab23d5676/src/share/vm/memory/heap.cpp#l202

But I don't think an allocation can cause the code heap to overflow 
there. An overflow would imply that
- (a) the number of committed segments is equal to the size of the 
memory space reserved for the code heap (in segments) and
- (b) the newly allocated block points to (or after) 
_number_of_committed_segments.

We can return a block pointing to _number_of_committed_segments if 
_next_segment==_number_of_committed_segments and number_of_segments <= 0 
so that the condition

_next_segment + number_of_segments <= _number_of_committed_segments

holds. But we have an assert before that checks that number_of_segments 
 >= sizeof(FreeBlock) (which is in turn > 0).

Unfortunately, I don't see other ways the heap can become corrupted. But 
let's hope the guarantees the patch plans to add help catch this problem 
in the future (in case the patch does not already fix the problem).

Thank you!

Best regards,

Zoltan

>
> Thanks,
> Vladimir
>
> On 2/7/17 5:17 AM, Zoltán Majó wrote:
>> Hi Dean,
>> Hi Tobias,
>>
>>
>> thank you for the feedback!
>>
>> On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
>>> Hi Dean,
>>>
>>> On 07.02.2017 00:14, dean.long at oracle.com wrote:
>>>> When do we allocate a CodeBlob with a code size of 0?  Is it really
>>>> useful?  Would having a minimum code size of 1 fix the problem?
>>> Yes, I've discussed that with Zoltan offline and I agree that we
>>> should enforce a minimum code size of 1 and fix the stress test(s)
>>> accordingly. It shouldn't be necessary to be able to create regular
>>> CodeBlobs with code size 0.
>>
>> It is indeed a valid point that code sizes of 0 are not useful in 
>> practice.
>>
>> However, the problem is a bit more complicated than the regression test
>> (and my description) suggests it to be.  The regression test works only
>> if CodeCacheMinBlockLength set to 1.  Only that way is it possible to
>> allocate BufferBlobs with size == 1 segment and set
>> CodeBlob::_code_begin to point to the next segment (which is possibly in
>> the next code heap).
>>
>> For both crashes I've seen, however, CodeCacheMinBlockLength was equal
>> to 4.  With that CodeCacheMinBlockLength value, every allocation in the
>> code heap is at least 4 segments long.  It is therefore impossible to
>> create a BufferBlob whose _code_begin points "outside" of the space
>> reserved for that BufferBlob.  (Because BufferBlob::_code_begin will
>> either point to the beginning of second segment of the BufferBlob --
>> with product -- or somewhere into the second segment -- with fastdebug.)
>>
>> Other blob types can have their _code_begin point to the boundary of 4
>> segments.  E.g., nmethods are between 2 and 3 segments long.  If the
>> relocation information in the nmethod fills up the space to the 4
>> segment boundary, _code_begin will be exactly at the segment boundary.
>>
>> But I don't know the exact sequence of steps lead to the code heap
>> corruption (because the crash reproduced only twice so far).  I just
>> know that the first free block in the (corrupted) code heap is of length
>> 8 segments (the contents of the first freelist item and of the segmap
>> both confirm this); the heap is corrupted because the freelist starts
>> before the code heap.
>>
>> What sequence of steps lead to this?  Some nmethod
>> allocations/deallocations interleaved with BufferBlob
>> allocations/deallocations (possibly of size 0)?  I'm not sure.  In what
>> way were blocks in the heap merged/split by the
>> allocations/deallocations leading to the crash?  I don't know.
>>
>> Did allocation/deallocation of BufferBlobs of size 0 play a role at
>> all?  I don't know either and I won't be able to tell unless the problem
>> reproduces more often.  However, by using code size == 0 it is possible
>> to write a regression test that triggers a real problem (which is
>> hopefully related to the crashes we've been seeing).
>>
>> So I'd like to allow code sizes of 0 for now and attempt to fix
>> CodeCache::get_code_heap() and the CodeBlobIterator constructor the way
>> Vladimir suggested.  If that turns out to be difficult/risky or require
>> large code changes, I'll propose a fix instead that disallows code sizes
>> of 1 and fixes the test(s) -- just as you suggested.
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>>> Best regards,
>>> Tobias
>>>
>>>> dl
>>>>
>>>>
>>>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> please review the fix for 8173151.
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>>>
>>>>> The crash reported in the bug is caused by the corruption of the
>>>>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>>>>> heap points to an address one segment before the heap's address
>>>>> space. The sweeper starts iterating through the code heap from the
>>>>> beginning of the heap's address space. Thus, the sweeper assumes
>>>>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>>>>> the appropriate length and usage information). However, that is not
>>>>> the case, as the first item in the heap is actually *before* the
>>>>> heap. So the sweeper crashes.
>>>>>
>>>>> This is a hard-to-reproduce problem (I managed to reproduce it only
>>>>> once in 350 iterations, each iteration taking ~25 minutes). So the
>>>>> fix I propose is based on core file debugging and source code
>>>>> investigation. But I managed to write a regression test that
>>>>> triggers a problem similar to the original problem.
>>>>>
>>>>> I think that problem happens because a CodeBlob allocated in one
>>>>> code heap (A) is returned to a different code heap (B). When the
>>>>> CodeBlob is returned B, it is added to B's freelist. However, as the
>>>>> CodeBlob was allocated in A, the freelist of B now points into A
>>>>> (i.e., B is corrupted).
>>>>>
>>>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>>>>> determine to which code heap a 'cb' is supposed to be returned to.
>>>>> Since 8171008 (AOT) [1], the check is:
>>>>>
>>>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>>>> assert(cb != NULL, "CodeBlob is null");
>>>>> FOR_ALL_HEAPS(heap) {
>>>>> - if ((*heap)->contains(cb)) {
>>>>> + if ((*heap)->contains(cb->code_begin())) {
>>>>> return *heap;
>>>>> }
>>>>> }
>>>>>
>>>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>>>> - 'cb' is at the end code heap A and
>>>>> - the size of the code contained in 'cb' is 0 (therefore
>>>>> code_begin() returns the address after 'cb', i.e., the first address
>>>>> in code heap B).
>>>>>
>>>>> The fix proposes to restore CodeCache::get_code_heap() to its
>>>>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>>>>> check). I also propose to add some guarantees after
>>>>> allocation/deallocation in the code heap to possibly easier catch
>>>>> this or related problems in the future.
>>>>>
>>>>> The regression test I propose achieves the above condition and
>>>>> results in a crash. The regression test works only with product
>>>>> builds, because in a product build a BufferBlob fits into one
>>>>> segment whereas in a fastdebug build it does not.
>>>>>
>>>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>>>>> flag is currently develop and we would need to make it product for
>>>>> the test to work. (Other flags controlling the code cache, e.g.,
>>>>> CodeCacheExpansionSize, are also product.) I could also experiment
>>>>> with reproducing the problem with different block lengths/segment
>>>>> sizes, but that would most likely make the test more fragile (and
>>>>> CodeCacheSegmentSize is anyway develop as well).
>>>>>
>>>>> I tested the fix with JPRT, RBT is in progress.
>>>>>
>>>>> Thank you!
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Zoltan
>>>>>
>>>>> [1]
>>>>> http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>>>
>>