<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><br>
Hi John,<br>
<br>
Thanks for doing the thorough analysis and providing the numbers.
Interesting that G1 does 21703 mallocs at startup whereas Parallel
only does 3500...<br>
<br>
However, I think I need some more background on this change. We
are choosing to allocating into a single VirtualSpace instead of
using separate mallocs. I understand that this will reduce the
number of mallocs we do, but is that really the problem? The CR
says that we run out of memory due to the fact that the GC threads
allocate too much memory. We will still allocate the same amount
of memory just mmapped instead of malloced, right? Do we have any
test cases that fail before your fix but pass now?<br>
<br>
In fact, isn't the risk higher (at least on 32 bit platforms) that
we fail to allocate the bitmaps now that we try to allocate them
from one consecutive chunk of memory instead of several smaller
ones? I'm thinking that if the memory is fragmented we might not
get the contiguous memory we need for the VirtualSpace. But I am
definitely no expert in this.<br>
<br>
With the above reservation I think your change looks good. Here
are a couple of minor comments:<br>
<br>
CMMarkStack::allocate(size_t size)
<br>
"size" is kind of an overloaded name for an "allocate" method.
Could we call it "capacity" or "number_of_entries"?
<br>
<br>
<br>
<title>Snippet</title>
In ConcurrentMark::ConcurrentMark() we use both shifting and
division to accomplish the same thing on the lines after each
other. Could we maybe use the same style for both cases?<br>
<br>
// Card liveness bitmap size (in bits)
<br>
BitMap::idx_t card_bm_size = (heap_rs.size() +
CardTableModRefBS::card_size - 1)
<br>
>>
CardTableModRefBS::card_shift;
<br>
// Card liveness bitmap size (in bytes)
<br>
size_t card_bm_size_bytes = (card_bm_size + (BitsPerByte - 1)) /
BitsPerByte;
<br>
<br>
<br>
<br>
Also in ConcurrentMark::ConcurrentMark() I think that
"marked_bytes_size_bytes" is not such a great name. Probably we
could skip the first "bytes" in "marked_bytes_size" and just call
"marked_bytes_size_bytes" for "marked_size_bytes".
<br>
<br>
<br>
I think it would be nice to factor some of the new stuff in
ConcurrentMark::ConcurrentMark() out into methods. Both the
calculations of the sizes and the creation/setting of the bitmaps.
But I admit that this is just a style issue.
<br>
<br>
<br>
Thanks,<br>
Bengt<br>
<br>
<br>
<br>
<br>
<br>
On 2012-09-20 21:15, John Cuthbertson wrote:<br>
</div>
<blockquote cite="mid:505B6B65.1060201@oracle.com" type="cite">
Hi Everyone,<br>
<br>
Can I have a couple of volunteers review the changes for this CR -
the
webrev can be found at:
<a moz-do-not-send="true" class="moz-txt-link-freetext"
href="http://cr.openjdk.java.net/%7Ejohnc/7188263/webrev.0">http://cr.openjdk.java.net/~johnc/7188263/webrev.0</a>
?<br>
<br>
Summary:<br>
Compared to the other collectors, G1 consumes much more C heap
(even
during start up):<br>
<br>
ParallelGC (w/o ParallelOld):<br>
<br>
<tt>dr-evil{jcuthber}:210> ./test.sh -d64
-XX:-ZapUnusedHeapArea
-XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
-XX:+UseParallelGC -XX:+PrintMallocStatistics
-XX:-UseParallelOldGC<br>
java version "1.7.0"<br>
Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>
Java HotSpot(TM) 64-Bit Server VM (build
24.0-b20-internal-fastdebug,
mixed mode)<br>
allocation stats: 3488 mallocs (12MB), 1161 frees (0MB), 4MB
resrc</tt><br>
<br>
ParallelGC (w/ ParallelOld):<br>
<br>
<br>
<tt>dr-evil{jcuthber}:211> ./test.sh -d64
-XX:-ZapUnusedHeapArea
-XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
-XX:+UseParallelGC -XX:+PrintMallocStatistics<br>
java version "1.7.0"<br>
Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>
Java HotSpot(TM) 64-Bit Server VM (build
24.0-b20-internal-fastdebug,
mixed mode)<br>
allocation stats: 3553 mallocs (36MB), 1160 frees (0MB), 4MB
resrc</tt><br>
<br>
G1:<br>
<br>
<tt>dr-evil{jcuthber}:212> ./test.sh -d64
-XX:-ZapUnusedHeapArea
-XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
-XX:+UseG1GC -XX:+PrintMallocStatistics<br>
java version "1.7.0"<br>
Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>
Java HotSpot(TM) 64-Bit Server VM (build
24.0-b20-internal-fastdebug,
mixed mode)<br>
allocation stats: 21703 mallocs (212MB), 1158 frees (0MB), 4MB
resrc</tt><br>
<br>
With the parallel collector, the main culprit is the work queues.
For
ParallelGC (without ParallelOIdGC) the amount of space allocated
is
around 1Mb per GC thread. For ParallelGC (with ParallelOldGC) this
increases to around 3Mb per worker thread. In G1, the main
culprits
are the global marking stack, the work queues (for both GC threads
and
marking threads), and some per-worker structures used for liveness
accounting. This results in an additional 128Mb being allocated
for the
global marking stack and the amount allocated per-worker thread
increases to around 7Mb. On some systems (specifically large
T-series
SPARC) this increase in C heap consumption can result in
out-of-system-memory errors. These marking data structures are
critical
for G1. Reducing the sizes is a possible solution but increases
the
possibility of restarting marking due to overflowing the marking
stack(s), lengthening marking durations, and increasing the chance
of
an
evacuation failure and/or a Full GC.<br>
<br>
The solution we have adopted, therefore, is to allocate some of
these
marking data structures from virtual memory. This reduces the C
heap
consumption during start-up to:<br>
<br>
<tt>dr-evil{jcuthber}:216> ./test.sh -d64
-XX:-ZapUnusedHeapArea
-XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
-XX:+UseG1GC -XX:+PrintMallocStatistics<br>
java version "1.7.0"<br>
Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>
Java HotSpot(TM) 64-Bit Server VM (build
24.0-b18-internal-fastdebug,
mixed mode)<br>
allocation stats: 21682 mallocs (29MB), 1158 frees (0MB), 4MB
resrc</tt><br>
<br>
The memory is still allocated - just not from C heap. With these
changes, G1's C heap consumption is now approximately 2Mb for each
worker thread (which are the work queues themselves):<br>
<br>
C heap consumption (Mb) / # of GC threads<br>
<br>
<table border="1" cellpadding="2" cellspacing="2" width="50%">
<tbody>
<tr>
<td valign="top"><b>Collector / # GC Threads<br>
</b> </td>
<td valign="top"><b>1<br>
</b> </td>
<td valign="top"><b>2<br>
</b> </td>
<td valign="top"><b>3<br>
</b> </td>
<td valign="top"><b>3<br>
</b> </td>
<td valign="top"><b>5<br>
</b> </td>
</tr>
<tr>
<td valign="top">ParallelGC w/o ParallelOldGC<br>
</td>
<td valign="top">3<br>
</td>
<td valign="top">4<br>
</td>
<td valign="top">5<br>
</td>
<td valign="top">6<br>
</td>
<td valign="top">7<br>
</td>
</tr>
<tr>
<td valign="top">ParallelGC w/ ParallelOldGC<br>
</td>
<td valign="top">7<br>
</td>
<td valign="top">11<br>
</td>
<td valign="top">14<br>
</td>
<td valign="top">17<br>
</td>
<td valign="top">20<br>
</td>
</tr>
<tr>
<td valign="top">G1 before changes<br>
</td>
<td valign="top">149<br>
</td>
<td valign="top">156<br>
</td>
<td valign="top">163<br>
</td>
<td valign="top">170<br>
</td>
<td valign="top">177<br>
</td>
</tr>
<tr>
<td valign="top">G1 after changes<br>
</td>
<td valign="top">11<br>
</td>
<td valign="top">13<br>
</td>
<td valign="top">15<br>
</td>
<td valign="top">17<br>
</td>
<td valign="top">19<br>
</td>
</tr>
</tbody>
</table>
<br>
We shall also investigate reducing the work queue sizes, to
further
reduce the amount of C heap consumed, for some or all of the
collectors - but in a separate CR.<br>
<br>
Testing:<br>
GC test suite on x64 and sparc T-series with a low marking
threshold
and marking verification enabled; jprt. Our reference workload was
used
to verify that there was no significant performance difference.<br>
<br>
Thanks,<br>
<br>
JohnC<br>
</blockquote>
<br>
</body>
</html>