Crash on super large heap size using CMS and it's fix
M豪(Hal)
mo_jianhao at hotmail.com
Wed Sep 12 01:00:15 UTC 2012
Hi all,
This is Hal Mo<kungu.mjh at taobao.com> from Alibaba Group(with OCA).
Our hadoop namenode crashed, when we set the heap size to 135G using CMS GC.
Attached please find the crash log(hs_err_pid.log).
I can steadily reproduce the crash on a test machine with 190G physical memory, by a simple command:
$ java -Xmx135g -XX:+UseConcMarkSweepGC
Then I build a debug jvm and use gdb to debug the problem.
call stack
C [libc.so.6+0x7a9b0] memset+0x40
V [libjvm.so+0x2b6c42] BlockOffsetArray::set_remainder_to_point_to_start_incl(unsigned long, unsigned long, bool)+0xce
V [libjvm.so+0x2b7043] BlockOffsetArray::set_remainder_to_point_to_start(HeapWord*, HeapWord*, bool)+0x71
V [libjvm.so+0x2b728d] BlockOffsetArray::BlockOffsetArray(BlockOffsetSharedArray*, MemRegion, bool)+0x9f
V [libjvm.so+0x3c089f] BlockOffsetArrayNonContigSpace::BlockOffsetArrayNonContigSpace(BlockOffsetSharedArray*, MemRegion)+0x37
V [libjvm.so+0x3be56f] CompactibleFreeListSpace::CompactibleFreeListSpace(BlockOffsetSharedArray*, MemRegion, bool, FreeBlockDictionary::DictionaryChoice)+0x9b
V [libjvm.so+0x3fd2e1] ConcurrentMarkSweepGeneration::ConcurrentMarkSweepGeneration(ReservedSpace, unsigned long, int, CardTableRS*, bool, FreeBlockDictionary::DictionaryChoice)+0x1df
V [libjvm.so+0x4dc03e] GenerationSpec::init(ReservedSpace, int, GenRemSet*)+0x37c
V [libjvm.so+0x4ced40] GenCollectedHeap::initialize()+0x510
V [libjvm.so+0x7c23c3] Universe::initialize_heap()+0x31d
V [libjvm.so+0x7c27ec] universe_init()+0xa6
V [libjvm.so+0x5056e2] init_globals()+0x34
V [libjvm.so+0x7ac926] Threads::create_vm(JavaVMInitArgs*, bool*)+0x23a
V [libjvm.so+0x53f3d4] JNI_CreateJavaVM+0x7a
in function BlockOffsetArray::set_remainder_to_point_to_start_inc, inside the for loop:
size_t reach = start_card - 1 + (power_to_cards_back(i+1) - 1);
when i = 7, the value of reach was 0. then the loop could not break, and
_array->set_offset_array(start_card_for_region, reach, offset, reducing);
accessed the wrong address, and crashed.
the root cause was
static size_t power_to_cards_back(uint i) {
return (size_t)(1 << (LogBase * i));
}
the literal 1 is a 32bit int, and 1<<32 overflow.
Here was my fix(has been tested), also found in attached file cms_large_heap_crash.patch
+++ b/src/share/vm/memory/blockOffsetTable.hpp
@@ -289,7 +289,7 @@
};
static size_t power_to_cards_back(uint i) {
- return (size_t)(1 << (LogBase * i));
+ return (size_t)1 << (LogBase * i);
}
static size_t power_to_words_back(uint i) {
return power_to_cards_back(i) * N_words;
Contributed-by: Hal Mo <kungu.mjh at taobao.com>
Similar situation also found in G1, but the size is mega(2^20) based. 2^(32+20) is too large to overflow.
Krystal remind me, this changeset cover the same code, http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/c3a720eefe82
.
I do not build it on visual studio, someone please help to review the compatibility with VS.
Regards,
Hal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120912/865fe7fa/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hs_err_pid.log
Type: application/octet-stream
Size: 27650 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120912/865fe7fa/hs_err_pid.log>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cms_large_heap_crash.patch
Type: application/octet-stream
Size: 446 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120912/865fe7fa/cms_large_heap_crash.patch>
More information about the hotspot-gc-dev
mailing list