RFR(M): 7188263: G1: Excessive c_heap (malloc) consumption

Wed Sep 26 00:10:07 UTC 2012

Hi Bengt,

Thanks for the review. Replies inline...

On 09/21/12 07:17, Bengt Rutisson wrote:
>
> Hi John,
>
> Thanks for doing the thorough analysis and providing the numbers. 
> Interesting that G1 does 21703 mallocs at startup whereas Parallel 
> only does 3500...
>
> However, I think I need some more background on this change. We are 
> choosing to allocating into a single VirtualSpace instead of using 
> separate mallocs. I understand that this will reduce the number of 
> mallocs we do, but is that really the problem? The CR says that we run 
> out of memory due to the fact that the GC threads allocate too much 
> memory. We will still allocate the same amount of memory just mmapped 
> instead of malloced, right? Do we have any test cases that fail before 
> your fix but pass now?
We didn't run out of memory - we ran out of C heap.

There is an issue on Solaris (I have to check whether Ramki's email 
describes the underlying problem) where by we asked for a 26Gb heap 
(which we got) but the C heap was only given 2Gb. Vladimir saw the issue 
during COOPs testing. In this situation we ran out of C heap trying to 
allocate the work queues for the marking threads.

Azeem submitted an issue a few weeks ago as well that ran into the same 
problem (this time with ParallelGC/ParallelOld) and the process died 
while trying to allocate the PSPromotionManager instances (and their 
internal work queue) - see CR 7167969. I believe that 7197666 is similar.

A real reduction in memory use will come from reducing the task queue 
size. I want to do that in a separate CR (for some or all of the 
collectors)  provided there is no real performance difference.

>
> In fact, isn't the risk higher (at least on 32 bit platforms) that we 
> fail to allocate the bitmaps now that we try to allocate them from one 
> consecutive chunk of memory instead of several smaller ones? I'm 
> thinking that if the memory is fragmented we might not get the 
> contiguous memory we need for the VirtualSpace. But I am definitely no 
> expert in this.

Good question and I wish I knew the answer. The occupancy of the task 
card bitmaps depends on the # of GC threads and the heap size (1 bit per 
card) - I guess it might be possible. Should I then only be allocating 
out of the backing store in a 64-bit JVM?
>
> With the above reservation I think your change looks good. Here are a 
> couple of minor comments:
>
> CMMarkStack::allocate(size_t size)
> "size" is kind of an overloaded name for an "allocate" method. Could 
> we call it "capacity" or "number_of_entries"?

Sure. Done. I renamed 'size' to 'capacity' - which matches the field 
name. But you bring up a good point about this confusion. If we look at 
the CMS version of the same routine:

bool CMSMarkStack::allocate(size_t size) {
  // allocate a stack of the requisite depth
  ReservedSpace rs(ReservedSpace::allocation_align_size_up(
                   size * sizeof(oop)));
  if (!rs.is_reserved()) {
    warning("CMSMarkStack allocation failure");
    return false;
  }
  if (!_virtual_space.initialize(rs, rs.size())) {
    warning("CMSMarkStack backing store failure");
    return false;
  }
  assert(_virtual_space.committed_size() == rs.size(),
         "didn't reserve backing store for all of CMS stack?");
  _base = (oop*)(_virtual_space.low());
  _index = 0;
  _capacity = size;
  NOT_PRODUCT(_max_depth = 0);
  return true;
}

it used the parameter as the real size in bytes of the marking stack. In 
G1, since we used NEW_C_HEAP_ARRAY, the size was multiplied by the size 
of an oop and we got a larger marking stack than requested. Instead of a 
16 Mb stack (128 * 128K), we got a 128Mb stack (8 * 128 * 128K). 
Basically we asked for the equivalent of the task queues for another 128 
threads:

dr-evil{jcuthber}:265> ./test.sh -d64 -XX:-ZapUnusedHeapArea 
-XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g 
-XX:+UseG1GC -XX:+PrintMallocStatistics
Using java runtime at: 
/net/jre.us.oracle.com/p/v42/jdk/7/fcs/b147/binaries/solaris-sparcv9/jre
size: 16777216, MarkStackSize: 16777216, TASKQUEUE_SIZE: 131072
os::malloc 134217728 bytes --> 0x00000001010709e8
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b03-internal-fastdebug, 
mixed mode)
allocation stats: 24489 mallocs (212MB), 1147 frees (0MB), 4MB resrc

Mimicking CMS is probably the way to go here.
>
>
> In ConcurrentMark::ConcurrentMark() we use both shifting and division 
> to accomplish the same thing on the lines after each other. Could we 
> maybe use the same style for both cases?
>
>   // Card liveness bitmap size (in bits)
>   BitMap::idx_t card_bm_size = (heap_rs.size() + 
> CardTableModRefBS::card_size - 1)
>                                 >> CardTableModRefBS::card_shift;
>   // Card liveness bitmap size (in bytes)
>   size_t card_bm_size_bytes = (card_bm_size + (BitsPerByte - 1)) / 
> BitsPerByte;

Sure. Changed the latter one to use LogBitsPerByte.

> Also in ConcurrentMark::ConcurrentMark() I think  that 
> "marked_bytes_size_bytes" is not such a great name. Probably we could 
> skip the first "bytes" in "marked_bytes_size" and just call 
> "marked_bytes_size_bytes" for "marked_size_bytes".

OK. What about just marked_bytes_size?
>
> I think it would be nice to factor some of the new stuff in 
> ConcurrentMark::ConcurrentMark() out into methods. Both the 
> calculations of the sizes and the creation/setting of the bitmaps. But 
> I admit that this is just a style issue.

OK. I'll try a few things to pretty it up.

Expect a new webrev soon.

JohnC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120925/3ffaeda9/attachment.htm>