[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test
Zoltán Majó
zoltan.majo at oracle.com
Fri Feb 10 14:33:01 UTC 2017
Hi Rickard,
On 02/10/2017 11:34 AM, Rickard Bäckman wrote:
> Sorry for being a bit late here...
thank you for the clarification. Unfortunately, I was not able to
mention you as a reviewer in the changeset, as pushing it has already
succeeded by the time I've received your message.
>
> The reason for the change in the contains() check from cb to
> cb->code_begin() is that with the changes for AOT we no longer require
> the metadata (AOTCompiledMethod *) and code to be in one blob.
>
> AOT does things like:
>
> code-heap-start
> code-method-1
> code-method-2
> code-method-3
> code-method-4
> code-heap-end
>
> and when we link those methods we create AOTCompiledMethods which just
> have pointers to code-method-1.
>
> The AOTCodeHeap then knows the boundaries of the code heap.
> Checking if the heap contains the metadata (AOTCompiledMethod *) will
> fail. While checking if it contains the code will succeed.
That was my understanding as well, I think: The changeset keeps the
inclusion test based on code_begin() for AOT-compiled methods
http://hg.openjdk.java.net/jdk9/hs/hotspot/rev/ed9e29cd84f5#l1.8
and changes the inclusion test only for other code blob types.
http://hg.openjdk.java.net/jdk9/hs/hotspot/rev/ed9e29cd84f5#l5.9
Thank you!
Best regards,
Zoltan
>
> /R
>
> On 02/06, Zoltán Majó wrote:
>> Hi,
>>
>>
>> please review the fix for 8173151.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8173151
>> http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
>>
>> The crash reported in the bug is caused by the corruption of the
>> 'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
>> heap points to an address one segment before the heap's address
>> space. The sweeper starts iterating through the code heap from the
>> beginning of the heap's address space. Thus, the sweeper assumes
>> that the first item in the code heap is a HeapBlock/FreeBlock (with
>> the appropriate length and usage information). However, that is not
>> the case, as the first item in the heap is actually *before* the
>> heap. So the sweeper crashes.
>>
>> This is a hard-to-reproduce problem (I managed to reproduce it only
>> once in 350 iterations, each iteration taking ~25 minutes). So the
>> fix I propose is based on core file debugging and source code
>> investigation. But I managed to write a regression test that
>> triggers a problem similar to the original problem.
>>
>> I think that problem happens because a CodeBlob allocated in one
>> code heap (A) is returned to a different code heap (B). When the
>> CodeBlob is returned B, it is added to B's freelist. However, as the
>> CodeBlob was allocated in A, the freelist of B now points into A
>> (i.e., B is corrupted).
>>
>> The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
>> determine to which code heap a 'cb' is supposed to be returned to.
>> Since 8171008 (AOT) [1], the check is:
>>
>> CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
>> assert(cb != NULL, "CodeBlob is null");
>> FOR_ALL_HEAPS(heap) {
>> - if ((*heap)->contains(cb)) {
>> + if ((*heap)->contains(cb->code_begin())) {
>> return *heap;
>> }
>> }
>>
>> The blob 'cb' can be returned to the wrong heap if, for example:
>> - 'cb' is at the end code heap A and
>> - the size of the code contained in 'cb' is 0 (therefore
>> code_begin() returns the address after 'cb', i.e., the first address
>> in code heap B).
>>
>> The fix proposes to restore CodeCache::get_code_heap() to its
>> pre-AOT state (as I'm not aware of the reasons why AOT changed that
>> check). I also propose to add some guarantees after
>> allocation/deallocation in the code heap to possibly easier catch
>> this or related problems in the future.
>>
>> The regression test I propose achieves the above condition and
>> results in a crash. The regression test works only with product
>> builds, because in a product build a BufferBlob fits into one
>> segment whereas in a fastdebug build it does not.
>>
>> The test needs to set the CodeCacheMinBlockLength flag to 1. The
>> flag is currently develop and we would need to make it product for
>> the test to work. (Other flags controlling the code cache, e.g.,
>> CodeCacheExpansionSize, are also product.) I could also experiment
>> with reproducing the problem with different block lengths/segment
>> sizes, but that would most likely make the test more fragile (and
>> CodeCacheSegmentSize is anyway develop as well).
>>
>> I tested the fix with JPRT, RBT is in progress.
>>
>> Thank you!
>>
>> Best regards,
>>
>>
>> Zoltan
>>
>> [1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
>>
More information about the hotspot-compiler-dev
mailing list