<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><br>
Hi John,<br>
<br>
Thanks for the extra explanations. They helped a lot! And I think
your suggestions, for addressing the comments I had, sounds good.
In particular it makes sense to treat the task queue size the same
way in CMS and G1.<br>
<br>
I'll look at your updated webrev when you send it out.<br>
<br>
Regarding whether or not to only use VirtualSpaces on 64 bit I
feel a bit unsure. Using VirtualSpaces already make the code more
complex than it was before with C heap allocations. Introducing
platform dependencies on top of that seems to create a big mess.
If it somehow is possible to abstract the allocation away so we
keep it in one place it might be worth treating 32 and 64 bit
differently.<br>
<br>
Not sure if this is a good thought; but if we would make it
optional to use VirtualSpaces or CHeap to support 32/64 bit
differences, would it make sense to only use VirtualSpaces on
Solaris? You mentioned that the out-of-C-heap issue seem to happen
due to a Solaris bug.<br>
<br>
Cheers,<br>
Bengt<br>
<br>
On 2012-09-26 02:10, John Cuthbertson wrote:<br>
</div>
<blockquote cite="mid:506247DF.80709@oracle.com" type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
Hi Bengt,<br>
<br>
Thanks for the review. Replies inline...<br>
<br>
On 09/21/12 07:17, Bengt Rutisson wrote:
<blockquote cite="mid:505C76EC.5030902@oracle.com" type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<div class="moz-cite-prefix"><br>
Hi John,<br>
<br>
Thanks for doing the thorough analysis and providing the
numbers.
Interesting that G1 does 21703 mallocs at startup whereas
Parallel only
does 3500...<br>
<br>
However, I think I need some more background on this change.
We are
choosing to allocating into a single VirtualSpace instead of
using
separate mallocs. I understand that this will reduce the
number of
mallocs we do, but is that really the problem? The CR says
that we run
out of memory due to the fact that the GC threads allocate too
much
memory. We will still allocate the same amount of memory just
mmapped
instead of malloced, right? Do we have any test cases that
fail before
your fix but pass now?<br>
</div>
</blockquote>
We didn't run out of memory - we ran out of C heap. <br>
<br>
There is an issue on Solaris (I have to check whether Ramki's
email
describes the underlying problem) where by we asked for a 26Gb
heap
(which we got) but the C heap was only given 2Gb. Vladimir saw the
issue during COOPs testing. In this situation we ran out of C heap
trying to allocate the work queues for the marking threads.<br>
<br>
Azeem submitted an issue a few weeks ago as well that ran into the
same
problem (this time with ParallelGC/ParallelOld) and the process
died
while trying to allocate the PSPromotionManager instances (and
their
internal work queue) - see CR 7167969. I believe that 7197666 is
similar.<br>
<br>
A real reduction in memory use will come from reducing the task
queue
size. I want to do that in a separate CR (for some or all of the
collectors) provided there is no real performance difference.<br>
<br>
<blockquote cite="mid:505C76EC.5030902@oracle.com" type="cite">
<div class="moz-cite-prefix"> <br>
In fact, isn't the risk higher (at least on 32 bit platforms)
that we
fail to allocate the bitmaps now that we try to allocate them
from one
consecutive chunk of memory instead of several smaller ones?
I'm
thinking that if the memory is fragmented we might not get the
contiguous memory we need for the VirtualSpace. But I am
definitely no
expert in this.<br>
</div>
</blockquote>
<br>
Good question and I wish I knew the answer. The occupancy of the
task
card bitmaps depends on the # of GC threads and the heap size (1
bit
per card) - I guess it might be possible. Should I then only be
allocating out of the backing store in a 64-bit JVM?<br>
<blockquote cite="mid:505C76EC.5030902@oracle.com" type="cite">
<div class="moz-cite-prefix"> <br>
With the above reservation I think your change looks good.
Here are a
couple of minor comments:<br>
<br>
CMMarkStack::allocate(size_t size) <br>
"size" is kind of an overloaded name for an "allocate" method.
Could we
call it "capacity" or "number_of_entries"? <br>
</div>
</blockquote>
<br>
Sure. Done. I renamed 'size' to 'capacity' - which matches the
field
name. But you bring up a good point about this confusion. If we
look at
the CMS version of the same routine:<br>
<br>
bool CMSMarkStack::allocate(size_t size) {<br>
// allocate a stack of the requisite depth<br>
ReservedSpace rs(ReservedSpace::allocation_align_size_up(<br>
size * sizeof(oop)));<br>
if (!rs.is_reserved()) {<br>
warning("CMSMarkStack allocation failure");<br>
return false;<br>
}<br>
if (!_virtual_space.initialize(rs, rs.size())) {<br>
warning("CMSMarkStack backing store failure");<br>
return false;<br>
}<br>
assert(_virtual_space.committed_size() == rs.size(),<br>
"didn't reserve backing store for all of CMS stack?");<br>
_base = (oop*)(_virtual_space.low());<br>
_index = 0;<br>
_capacity = size;<br>
NOT_PRODUCT(_max_depth = 0);<br>
return true;<br>
}<br>
<br>
it used the parameter as the real size in bytes of the marking
stack.
In G1, since we used NEW_C_HEAP_ARRAY, the size was multiplied by
the
size of an oop and we got a larger marking stack than requested.
Instead of a 16 Mb stack (128 * 128K), we got a 128Mb stack (8 *
128 *
128K). Basically we asked for the equivalent of the task queues
for
another 128 threads:<br>
<br>
dr-evil{jcuthber}:265> ./test.sh -d64 -XX:-ZapUnusedHeapArea
-XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g
-XX:+UseG1GC -XX:+PrintMallocStatistics<br>
Using java runtime at:
/net/jre.us.oracle.com/p/v42/jdk/7/fcs/b147/binaries/solaris-sparcv9/jre<br>
size: 16777216, MarkStackSize: 16777216, TASKQUEUE_SIZE: 131072<br>
os::malloc 134217728 bytes --> 0x00000001010709e8<br>
java version "1.7.0"<br>
Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>
Java HotSpot(TM) 64-Bit Server VM (build
25.0-b03-internal-fastdebug,
mixed mode)<br>
allocation stats: 24489 mallocs (212MB), 1147 frees (0MB), 4MB
resrc<br>
<br>
Mimicking CMS is probably the way to go here.<br>
<blockquote cite="mid:505C76EC.5030902@oracle.com" type="cite">
<div class="moz-cite-prefix"> <br>
<br>
<title>Snippet</title>
In ConcurrentMark::ConcurrentMark() we use both shifting and
division
to accomplish the same thing on the lines after each other.
Could we
maybe use the same style for both cases?<br>
<br>
// Card liveness bitmap size (in bits) <br>
BitMap::idx_t card_bm_size = (heap_rs.size() +
CardTableModRefBS::card_size - 1) <br>
>>
CardTableModRefBS::card_shift; <br>
// Card liveness bitmap size (in bytes) <br>
size_t card_bm_size_bytes = (card_bm_size + (BitsPerByte -
1)) /
BitsPerByte; <br>
</div>
</blockquote>
<br>
Sure. Changed the latter one to use LogBitsPerByte.<br>
<br>
<blockquote cite="mid:505C76EC.5030902@oracle.com" type="cite">
<div class="moz-cite-prefix"> Also in
ConcurrentMark::ConcurrentMark() I think that
"marked_bytes_size_bytes" is not such a great name. Probably
we could
skip the first "bytes" in "marked_bytes_size" and just call
"marked_bytes_size_bytes" for "marked_size_bytes". <br>
</div>
</blockquote>
<br>
OK. What about just marked_bytes_size?<br>
<blockquote cite="mid:505C76EC.5030902@oracle.com" type="cite">
<div class="moz-cite-prefix"> <br>
I think it would be nice to factor some of the new stuff in
ConcurrentMark::ConcurrentMark() out into methods. Both the
calculations of the sizes and the creation/setting of the
bitmaps. But
I admit that this is just a style issue. <br>
</div>
</blockquote>
<br>
OK. I'll try a few things to pretty it up.<br>
<br>
Expect a new webrev soon.<br>
<br>
JohnC<br>
</blockquote>
<br>
</body>
</html>