RFR(M): 7188263: G1: Excessive c_heap (malloc) consumption

Fri Sep 21 14:17:16 UTC 2012

Hi John,

Thanks for doing the thorough analysis and providing the numbers. 
Interesting that G1 does 21703 mallocs at startup whereas Parallel only 
does 3500...

However, I think I need some more background on this change. We are 
choosing to allocating into a single VirtualSpace instead of using 
separate mallocs. I understand that this will reduce the number of 
mallocs we do, but is that really the problem? The CR says that we run 
out of memory due to the fact that the GC threads allocate too much 
memory. We will still allocate the same amount of memory just mmapped 
instead of malloced, right? Do we have any test cases that fail before 
your fix but pass now?

In fact, isn't the risk higher (at least on 32 bit platforms) that we 
fail to allocate the bitmaps now that we try to allocate them from one 
consecutive chunk of memory instead of several smaller ones? I'm 
thinking that if the memory is fragmented we might not get the 
contiguous memory we need for the VirtualSpace. But I am definitely no 
expert in this.

With the above reservation I think your change looks good. Here are a 
couple of minor comments:

CMMarkStack::allocate(size_t size)
"size" is kind of an overloaded name for an "allocate" method. Could we 
call it "capacity" or "number_of_entries"?

Snippet In ConcurrentMark::ConcurrentMark() we use both shifting and 
division to accomplish the same thing on the lines after each other. 
Could we maybe use the same style for both cases?

   // Card liveness bitmap size (in bits)
   BitMap::idx_t card_bm_size = (heap_rs.size() + 
CardTableModRefBS::card_size - 1)
                                 >> CardTableModRefBS::card_shift;
   // Card liveness bitmap size (in bytes)
   size_t card_bm_size_bytes = (card_bm_size + (BitsPerByte - 1)) / 
BitsPerByte;

Also in ConcurrentMark::ConcurrentMark() I think  that 
"marked_bytes_size_bytes" is not such a great name. Probably we could 
skip the first "bytes" in "marked_bytes_size" and just call 
"marked_bytes_size_bytes" for "marked_size_bytes".

I think it would be nice to factor some of the new stuff in 
ConcurrentMark::ConcurrentMark() out into methods. Both the calculations 
of the sizes and the creation/setting of the bitmaps. But I admit that 
this is just a style issue.

Thanks,
Bengt

On 2012-09-20 21:15, John Cuthbertson wrote:
> Hi Everyone,
>
> Can I have a couple of volunteers review the changes for this CR - the 
> webrev can be found at: 
> http://cr.openjdk.java.net/~johnc/7188263/webrev.0 ?
>
> Summary:
> Compared to the other collectors, G1 consumes much more C heap (even 
> during start up):
>
> ParallelGC (w/o ParallelOld):
>
> dr-evil{jcuthber}:210> ./test.sh -d64 -XX:-ZapUnusedHeapArea 
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g 
> -XX:+UseParallelGC -XX:+PrintMallocStatistics -XX:-UseParallelOldGC
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b20-internal-fastdebug, 
> mixed mode)
> allocation stats: 3488 mallocs (12MB), 1161 frees (0MB), 4MB resrc
>
> ParallelGC (w/ ParallelOld):
>
>
> dr-evil{jcuthber}:211> ./test.sh -d64 -XX:-ZapUnusedHeapArea 
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g 
> -XX:+UseParallelGC -XX:+PrintMallocStatistics
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b20-internal-fastdebug, 
> mixed mode)
> allocation stats: 3553 mallocs (36MB), 1160 frees (0MB), 4MB resrc
>
> G1:
>
> dr-evil{jcuthber}:212> ./test.sh -d64 -XX:-ZapUnusedHeapArea 
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g 
> -XX:+UseG1GC -XX:+PrintMallocStatistics
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b20-internal-fastdebug, 
> mixed mode)
> allocation stats: 21703 mallocs (212MB), 1158 frees (0MB), 4MB resrc
>
> With the parallel collector, the main culprit is the work queues. For 
> ParallelGC (without ParallelOIdGC) the amount of space allocated is 
> around 1Mb per GC thread. For ParallelGC (with ParallelOldGC) this 
> increases to around 3Mb per worker thread.  In G1, the main culprits 
> are the global marking stack, the work queues (for both GC threads and 
> marking threads), and some per-worker structures used for liveness 
> accounting. This results in an additional 128Mb being allocated for 
> the global marking stack and the amount allocated per-worker thread 
> increases to around 7Mb. On some systems (specifically large T-series 
> SPARC) this increase in C heap consumption can result in 
> out-of-system-memory errors. These marking data structures are 
> critical for G1. Reducing the sizes is a possible solution but 
> increases the possibility of restarting marking due to overflowing the 
> marking stack(s), lengthening marking durations, and increasing the 
> chance of an evacuation failure and/or a Full GC.
>
> The solution we have adopted, therefore, is to allocate some of these 
> marking data structures from virtual memory. This reduces the C heap 
> consumption during start-up to:
>
> dr-evil{jcuthber}:216> ./test.sh -d64 -XX:-ZapUnusedHeapArea 
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g 
> -XX:+UseG1GC -XX:+PrintMallocStatistics
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 24.0-b18-internal-fastdebug, 
> mixed mode)
> allocation stats: 21682 mallocs (29MB), 1158 frees (0MB), 4MB resrc
>
> The memory is still allocated - just not from C heap. With these 
> changes, G1's C heap consumption is now approximately 2Mb for each 
> worker thread (which are the work queues themselves):
>
> C heap consumption (Mb) / # of GC threads
>
> *Collector / # GC Threads
> * 	*1
> * 	*2
> * 	*3
> * 	*3
> * 	*5
> *
> ParallelGC w/o ParallelOldGC
> 	3
> 	4
> 	5
> 	6
> 	7
> ParallelGC w/ ParallelOldGC
> 	7
> 	11
> 	14
> 	17
> 	20
> G1 before changes
> 	149
> 	156
> 	163
> 	170
> 	177
> G1 after changes
> 	11
> 	13
> 	15
> 	17
> 	19
>
>
> We shall also investigate reducing the work queue sizes, to further 
> reduce the amount of C heap consumed,  for some or all of the 
> collectors - but in a separate CR.
>
> Testing:
> GC test suite on x64 and sparc T-series with a low marking threshold 
> and marking verification enabled; jprt. Our reference workload was 
> used to verify that there was no significant performance difference.
>
> Thanks,
>
> JohnC

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120921/bacaaf6e/attachment.htm>