RFR(M): 7188263: G1: Excessive c_heap (malloc) consumption
Bengt Rutisson
bengt.rutisson at oracle.com
Fri Sep 21 14:17:16 UTC 2012
Hi John,
Thanks for doing the thorough analysis and providing the numbers.
Interesting that G1 does 21703 mallocs at startup whereas Parallel only
does 3500...
However, I think I need some more background on this change. We are
choosing to allocating into a single VirtualSpace instead of using
separate mallocs. I understand that this will reduce the number of
mallocs we do, but is that really the problem? The CR says that we run
out of memory due to the fact that the GC threads allocate too much
memory. We will still allocate the same amount of memory just mmapped
instead of malloced, right? Do we have any test cases that fail before
your fix but pass now?
In fact, isn't the risk higher (at least on 32 bit platforms) that we
fail to allocate the bitmaps now that we try to allocate them from one
consecutive chunk of memory instead of several smaller ones? I'm
thinking that if the memory is fragmented we might not get the
contiguous memory we need for the VirtualSpace. But I am definitely no
expert in this.
With the above reservation I think your change looks good. Here are a
couple of minor comments:
CMMarkStack::allocate(size_t size)
"size" is kind of an overloaded name for an "allocate" method. Could we
call it "capacity" or "number_of_entries"?
Snippet In ConcurrentMark::ConcurrentMark() we use both shifting and
division to accomplish the same thing on the lines after each other.
Could we maybe use the same style for both cases?
// Card liveness bitmap size (in bits)
BitMap::idx_t card_bm_size = (heap_rs.size() +
CardTableModRefBS::card_size - 1)
>> CardTableModRefBS::card_shift;
// Card liveness bitmap size (in bytes)
size_t card_bm_size_bytes = (card_bm_size + (BitsPerByte - 1)) /
BitsPerByte;
Also in ConcurrentMark::ConcurrentMark() I think that
"marked_bytes_size_bytes" is not such a great name. Probably we could
skip the first "bytes" in "marked_bytes_size" and just call
"marked_bytes_size_bytes" for "marked_size_bytes".
I think it would be nice to factor some of the new stuff in
ConcurrentMark::ConcurrentMark() out into methods. Both the calculations
of the sizes and the creation/setting of the bitmaps. But I admit that
this is just a style issue.
Thanks,
Bengt
On 2012-09-20 21:15, John Cuthbertson wrote:
> Hi Everyone,
>
> Can I have a couple of volunteers review the changes for this CR - the
> webrev can be found at:
> http://cr.openjdk.java.net/~johnc/7188263/webrev.0 ?
>
> Summary:
> Compared to the other collectors, G1 consumes much more C heap (even
> during start up):
>
> ParallelGC (w/o ParallelOld):
>
> dr-evil{jcuthber}:210> ./test.sh -d64 -XX:-ZapUnusedHeapArea
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
> -XX:+UseParallelGC -XX:+PrintMallocStatistics -XX:-UseParallelOldGC
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b20-internal-fastdebug,
> mixed mode)
> allocation stats: 3488 mallocs (12MB), 1161 frees (0MB), 4MB resrc
>
> ParallelGC (w/ ParallelOld):
>
>
> dr-evil{jcuthber}:211> ./test.sh -d64 -XX:-ZapUnusedHeapArea
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
> -XX:+UseParallelGC -XX:+PrintMallocStatistics
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b20-internal-fastdebug,
> mixed mode)
> allocation stats: 3553 mallocs (36MB), 1160 frees (0MB), 4MB resrc
>
> G1:
>
> dr-evil{jcuthber}:212> ./test.sh -d64 -XX:-ZapUnusedHeapArea
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
> -XX:+UseG1GC -XX:+PrintMallocStatistics
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b20-internal-fastdebug,
> mixed mode)
> allocation stats: 21703 mallocs (212MB), 1158 frees (0MB), 4MB resrc
>
> With the parallel collector, the main culprit is the work queues. For
> ParallelGC (without ParallelOIdGC) the amount of space allocated is
> around 1Mb per GC thread. For ParallelGC (with ParallelOldGC) this
> increases to around 3Mb per worker thread. In G1, the main culprits
> are the global marking stack, the work queues (for both GC threads and
> marking threads), and some per-worker structures used for liveness
> accounting. This results in an additional 128Mb being allocated for
> the global marking stack and the amount allocated per-worker thread
> increases to around 7Mb. On some systems (specifically large T-series
> SPARC) this increase in C heap consumption can result in
> out-of-system-memory errors. These marking data structures are
> critical for G1. Reducing the sizes is a possible solution but
> increases the possibility of restarting marking due to overflowing the
> marking stack(s), lengthening marking durations, and increasing the
> chance of an evacuation failure and/or a Full GC.
>
> The solution we have adopted, therefore, is to allocate some of these
> marking data structures from virtual memory. This reduces the C heap
> consumption during start-up to:
>
> dr-evil{jcuthber}:216> ./test.sh -d64 -XX:-ZapUnusedHeapArea
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
> -XX:+UseG1GC -XX:+PrintMallocStatistics
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b18-internal-fastdebug,
> mixed mode)
> allocation stats: 21682 mallocs (29MB), 1158 frees (0MB), 4MB resrc
>
> The memory is still allocated - just not from C heap. With these
> changes, G1's C heap consumption is now approximately 2Mb for each
> worker thread (which are the work queues themselves):
>
> C heap consumption (Mb) / # of GC threads
>
> *Collector / # GC Threads
> * *1
> * *2
> * *3
> * *3
> * *5
> *
> ParallelGC w/o ParallelOldGC
> 3
> 4
> 5
> 6
> 7
> ParallelGC w/ ParallelOldGC
> 7
> 11
> 14
> 17
> 20
> G1 before changes
> 149
> 156
> 163
> 170
> 177
> G1 after changes
> 11
> 13
> 15
> 17
> 19
>
>
> We shall also investigate reducing the work queue sizes, to further
> reduce the amount of C heap consumed, for some or all of the
> collectors - but in a separate CR.
>
> Testing:
> GC test suite on x64 and sparc T-series with a low marking threshold
> and marking verification enabled; jprt. Our reference workload was
> used to verify that there was no significant performance difference.
>
> Thanks,
>
> JohnC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120921/bacaaf6e/attachment.htm>
More information about the hotspot-gc-dev
mailing list