[9] RFR (S): 8173151: Code heap corruption due to incorrect inclusion test
Zoltán Majó
zoltan.majo at oracle.com
Mon Feb 6 15:29:15 UTC 2017
Hi,
please review the fix for 8173151.
https://bugs.openjdk.java.net/browse/JDK-8173151
http://cr.openjdk.java.net/~zmajo/8173151/webrev.00/
The crash reported in the bug is caused by the corruption of the
'non-profiled nmethods' code heap. CodeHeap::_freelist for that code
heap points to an address one segment before the heap's address space.
The sweeper starts iterating through the code heap from the beginning of
the heap's address space. Thus, the sweeper assumes that the first item
in the code heap is a HeapBlock/FreeBlock (with the appropriate length
and usage information). However, that is not the case, as the first item
in the heap is actually *before* the heap. So the sweeper crashes.
This is a hard-to-reproduce problem (I managed to reproduce it only once
in 350 iterations, each iteration taking ~25 minutes). So the fix I
propose is based on core file debugging and source code investigation.
But I managed to write a regression test that triggers a problem similar
to the original problem.
I think that problem happens because a CodeBlob allocated in one code
heap (A) is returned to a different code heap (B). When the CodeBlob is
returned B, it is added to B's freelist. However, as the CodeBlob was
allocated in A, the freelist of B now points into A (i.e., B is corrupted).
The code cache uses CodeCache::get_code_heap(const CodeBlob* cb) to
determine to which code heap a 'cb' is supposed to be returned to. Since
8171008 (AOT) [1], the check is:
CodeHeap* CodeCache::get_code_heap(const CodeBlob* cb) {
assert(cb != NULL, "CodeBlob is null");
FOR_ALL_HEAPS(heap) {
- if ((*heap)->contains(cb)) {
+ if ((*heap)->contains(cb->code_begin())) {
return *heap;
}
}
The blob 'cb' can be returned to the wrong heap if, for example:
- 'cb' is at the end code heap A and
- the size of the code contained in 'cb' is 0 (therefore code_begin()
returns the address after 'cb', i.e., the first address in code heap B).
The fix proposes to restore CodeCache::get_code_heap() to its pre-AOT
state (as I'm not aware of the reasons why AOT changed that check). I
also propose to add some guarantees after allocation/deallocation in the
code heap to possibly easier catch this or related problems in the future.
The regression test I propose achieves the above condition and results
in a crash. The regression test works only with product builds, because
in a product build a BufferBlob fits into one segment whereas in a
fastdebug build it does not.
The test needs to set the CodeCacheMinBlockLength flag to 1. The flag is
currently develop and we would need to make it product for the test to
work. (Other flags controlling the code cache, e.g.,
CodeCacheExpansionSize, are also product.) I could also experiment with
reproducing the problem with different block lengths/segment sizes, but
that would most likely make the test more fragile (and
CodeCacheSegmentSize is anyway develop as well).
I tested the fix with JPRT, RBT is in progress.
Thank you!
Best regards,
Zoltan
[1] http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/777aaa19c4b1#l116.71
More information about the hotspot-compiler-dev
mailing list