RFR(M): 7188263: G1: Excessive c_heap (malloc) consumption
John Cuthbertson
john.cuthbertson at oracle.com
Wed Sep 26 00:10:07 UTC 2012
Hi Bengt,
Thanks for the review. Replies inline...
On 09/21/12 07:17, Bengt Rutisson wrote:
>
> Hi John,
>
> Thanks for doing the thorough analysis and providing the numbers.
> Interesting that G1 does 21703 mallocs at startup whereas Parallel
> only does 3500...
>
> However, I think I need some more background on this change. We are
> choosing to allocating into a single VirtualSpace instead of using
> separate mallocs. I understand that this will reduce the number of
> mallocs we do, but is that really the problem? The CR says that we run
> out of memory due to the fact that the GC threads allocate too much
> memory. We will still allocate the same amount of memory just mmapped
> instead of malloced, right? Do we have any test cases that fail before
> your fix but pass now?
We didn't run out of memory - we ran out of C heap.
There is an issue on Solaris (I have to check whether Ramki's email
describes the underlying problem) where by we asked for a 26Gb heap
(which we got) but the C heap was only given 2Gb. Vladimir saw the issue
during COOPs testing. In this situation we ran out of C heap trying to
allocate the work queues for the marking threads.
Azeem submitted an issue a few weeks ago as well that ran into the same
problem (this time with ParallelGC/ParallelOld) and the process died
while trying to allocate the PSPromotionManager instances (and their
internal work queue) - see CR 7167969. I believe that 7197666 is similar.
A real reduction in memory use will come from reducing the task queue
size. I want to do that in a separate CR (for some or all of the
collectors) provided there is no real performance difference.
>
> In fact, isn't the risk higher (at least on 32 bit platforms) that we
> fail to allocate the bitmaps now that we try to allocate them from one
> consecutive chunk of memory instead of several smaller ones? I'm
> thinking that if the memory is fragmented we might not get the
> contiguous memory we need for the VirtualSpace. But I am definitely no
> expert in this.
Good question and I wish I knew the answer. The occupancy of the task
card bitmaps depends on the # of GC threads and the heap size (1 bit per
card) - I guess it might be possible. Should I then only be allocating
out of the backing store in a 64-bit JVM?
>
> With the above reservation I think your change looks good. Here are a
> couple of minor comments:
>
> CMMarkStack::allocate(size_t size)
> "size" is kind of an overloaded name for an "allocate" method. Could
> we call it "capacity" or "number_of_entries"?
Sure. Done. I renamed 'size' to 'capacity' - which matches the field
name. But you bring up a good point about this confusion. If we look at
the CMS version of the same routine:
bool CMSMarkStack::allocate(size_t size) {
// allocate a stack of the requisite depth
ReservedSpace rs(ReservedSpace::allocation_align_size_up(
size * sizeof(oop)));
if (!rs.is_reserved()) {
warning("CMSMarkStack allocation failure");
return false;
}
if (!_virtual_space.initialize(rs, rs.size())) {
warning("CMSMarkStack backing store failure");
return false;
}
assert(_virtual_space.committed_size() == rs.size(),
"didn't reserve backing store for all of CMS stack?");
_base = (oop*)(_virtual_space.low());
_index = 0;
_capacity = size;
NOT_PRODUCT(_max_depth = 0);
return true;
}
it used the parameter as the real size in bytes of the marking stack. In
G1, since we used NEW_C_HEAP_ARRAY, the size was multiplied by the size
of an oop and we got a larger marking stack than requested. Instead of a
16 Mb stack (128 * 128K), we got a 128Mb stack (8 * 128 * 128K).
Basically we asked for the equivalent of the task queues for another 128
threads:
dr-evil{jcuthber}:265> ./test.sh -d64 -XX:-ZapUnusedHeapArea
-XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
-XX:+UseG1GC -XX:+PrintMallocStatistics
Using java runtime at:
/net/jre.us.oracle.com/p/v42/jdk/7/fcs/b147/binaries/solaris-sparcv9/jre
size: 16777216, MarkStackSize: 16777216, TASKQUEUE_SIZE: 131072
os::malloc 134217728 bytes --> 0x00000001010709e8
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b03-internal-fastdebug,
mixed mode)
allocation stats: 24489 mallocs (212MB), 1147 frees (0MB), 4MB resrc
Mimicking CMS is probably the way to go here.
>
>
> In ConcurrentMark::ConcurrentMark() we use both shifting and division
> to accomplish the same thing on the lines after each other. Could we
> maybe use the same style for both cases?
>
> // Card liveness bitmap size (in bits)
> BitMap::idx_t card_bm_size = (heap_rs.size() +
> CardTableModRefBS::card_size - 1)
> >> CardTableModRefBS::card_shift;
> // Card liveness bitmap size (in bytes)
> size_t card_bm_size_bytes = (card_bm_size + (BitsPerByte - 1)) /
> BitsPerByte;
Sure. Changed the latter one to use LogBitsPerByte.
> Also in ConcurrentMark::ConcurrentMark() I think that
> "marked_bytes_size_bytes" is not such a great name. Probably we could
> skip the first "bytes" in "marked_bytes_size" and just call
> "marked_bytes_size_bytes" for "marked_size_bytes".
OK. What about just marked_bytes_size?
>
> I think it would be nice to factor some of the new stuff in
> ConcurrentMark::ConcurrentMark() out into methods. Both the
> calculations of the sizes and the creation/setting of the bitmaps. But
> I admit that this is just a style issue.
OK. I'll try a few things to pretty it up.
Expect a new webrev soon.
JohnC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120925/3ffaeda9/attachment.htm>
More information about the hotspot-gc-dev
mailing list