[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test
Zoltán Majó
zoltan.majo at oracle.com
Tue Feb 7 13:17:19 UTC 2017
Hi Dean,
Hi Tobias,
thank you for the feedback!
On 02/07/2017 07:01 AM, Tobias Hartmann wrote:
> Hi Dean,
>
> On 07.02.2017 00:14, dean.long at oracle.com wrote:
>> When do we allocate a CodeBlob with a code size of 0? Is it really useful? Would having a minimum code size of 1 fix the problem?
> Yes, I've discussed that with Zoltan offline and I agree that we should enforce a minimum code size of 1 and fix the stress test(s) accordingly. It shouldn't be necessary to be able to create regular CodeBlobs with code size 0.
It is indeed a valid point that code sizes of 0 are not useful in practice.
However, the problem is a bit more complicated than the regression test
(and my description) suggests it to be. The regression test works only
if CodeCacheMinBlockLength set to 1. Only that way is it possible to
allocate BufferBlobs with size == 1 segment and set
CodeBlob::_code_begin to point to the next segment (which is possibly in
the next code heap).
For both crashes I've seen, however, CodeCacheMinBlockLength was equal
to 4. With that CodeCacheMinBlockLength value, every allocation in the
code heap is at least 4 segments long. It is therefore impossible to
create a BufferBlob whose _code_begin points "outside" of the space
reserved for that BufferBlob. (Because BufferBlob::_code_begin will
either point to the beginning of second segment of the BufferBlob --
with product -- or somewhere into the second segment -- with fastdebug.)
Other blob types can have their _code_begin point to the boundary of 4
segments. E.g., nmethods are between 2 and 3 segments long. If the
relocation information in the nmethod fills up the space to the 4
segment boundary, _code_begin will be exactly at the segment boundary.
But I don't know the exact sequence of steps lead to the code heap
corruption (because the crash reproduced only twice so far). I just
know that the first free block in the (corrupted) code heap is of length
8 segments (the contents of the first freelist item and of the segmap
both confirm this); the heap is corrupted because the freelist starts
before the code heap.
What sequence of steps lead to this? Some nmethod
allocations/deallocations interleaved with BufferBlob
allocations/deallocations (possibly of size 0)? I'm not sure. In what
way were blocks in the heap merged/split by the
allocations/deallocations leading to the crash? I don't know.
Did allocation/deallocation of BufferBlobs of size 0 play a role at
all? I don't know either and I won't be able to tell unless the problem
reproduces more often. However, by using code size == 0 it is possible
to write a regression test that triggers a real problem (which is
hopefully related to the crashes we've been seeing).
So I'd like to allow code sizes of 0 for now and attempt to fix
CodeCache::get_code_heap() and the CodeBlobIterator constructor the way
Vladimir suggested. If that turns out to be difficult/risky or require
large code changes, I'll propose a fix instead that disallows code sizes
of 1 and fixes the test(s) -- just as you suggested.
Best regards,
Zoltan
> Best regards,
> Tobias
>
>> dl
>>
>>
>> On 2/6/17 7:29 AM, Zoltán Majó wrote:
>>> Hi,
>>>
>>>
>>> please review the fix for 8173151.
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8173151
>>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>>
>>> The crash reported in the bug is caused by the corruption of the 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code heap points to an address one segment before the heap's address space. The sweeper starts iterating through the code heap from the beginning of the heap's address space. Thus, the sweeper assumes that the first item in the code heap is a HeapBlock/FreeBlock (with the appropriate length and usage information). However, that is not the case, as the first item in the heap is actually *before* the heap. So the sweeper crashes.
>>>
>>> This is a hard-to-reproduce problem (I managed to reproduce it only once in 350 iterations, each iteration taking ~25 minutes). So the fix I propose is based on core file debugging and source code investigation. But I managed to write a regression test that triggers a problem similar to the original problem.
>>>
>>> I think that problem happens because a CodeBlob allocated in one code heap (A) is returned to a different code heap (B). When the CodeBlob is returned B, it is added to B's freelist. However, as the CodeBlob was allocated in A, the freelist of B now points into A (i.e., B is corrupted).
>>>
>>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to determine to which code heap a 'cb' is supposed to be returned to. Since 8171008 (AOT) [1], the check is:
>>>
>>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>>> assert(cb != NULL, "CodeBlob is null");
>>> FOR_ALL_HEAPS(heap) {
>>> - if ((*heap)->contains(cb)) {
>>> + if ((*heap)->contains(cb->code_begin())) {
>>> return *heap;
>>> }
>>> }
>>>
>>> The blob 'cb' can be returned to the wrong heap if, for example:
>>> - 'cb' is at the end code heap A and
>>> - the size of the code contained in 'cb' is 0 (therefore code_begin() returns the address after 'cb', i.e., the first address in code heap B).
>>>
>>> The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT state (as I'm not aware of the reasons why AOT changed that check). I also propose to add some guarantees after allocation/deallocation in the code heap to possibly easier catch this or related problems in the future.
>>>
>>> The regression test I propose achieves the above condition and results in a crash. The regression test works only with product builds, because in a product build a BufferBlob fits into one segment whereas in a fastdebug build it does not.
>>>
>>> The test needs to set the CodeCacheMinBlockLength flag to 1. The flag is currently develop and we would need to make it product for the test to work. (Other flags controlling the code cache, e.g., CodeCacheExpansionSize, are also product.) I could also experiment with reproducing the problem with different block lengths/segment sizes, but that would most likely make the test more fragile (and CodeCacheSegmentSize is anyway develop as well).
>>>
>>> I tested the fix with JPRT, RBT is in progress.
>>>
>>> Thank you!
>>>
>>> Best regards,
>>>
>>>
>>> Zoltan
>>>
>>> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>>
More information about the hotspot-compiler-dev
mailing list