RFR(M): 7188263: G1: Excessive c_heap (malloc) consumption

Thu Sep 27 19:28:30 UTC 2012

Hi John,

Thanks for the extra explanations. They helped a lot! And I think your 
suggestions, for addressing the comments I had, sounds good. In 
particular it makes sense to treat the task queue size the same way in 
CMS and G1.

I'll look at your updated webrev when you send it out.

Regarding whether or not to only use VirtualSpaces on 64 bit I feel a 
bit unsure. Using VirtualSpaces already make the code more complex than 
it was before with C heap allocations. Introducing platform dependencies 
on top of that seems to create a big mess. If it somehow is possible to 
abstract the allocation away so we keep it in one place it might be 
worth treating 32 and 64 bit differently.

Not sure if this is a good thought; but if we would make it optional to 
use VirtualSpaces or CHeap to support 32/64 bit differences, would it 
make sense to only use VirtualSpaces on Solaris? You mentioned that the 
out-of-C-heap issue seem to happen due to a Solaris bug.

Cheers,
Bengt

On 2012-09-26 02:10, John Cuthbertson wrote:
> Hi Bengt,
>
> Thanks for the review. Replies inline...
>
> On 09/21/12 07:17, Bengt Rutisson wrote:
>>
>> Hi John,
>>
>> Thanks for doing the thorough analysis and providing the numbers. 
>> Interesting that G1 does 21703 mallocs at startup whereas Parallel 
>> only does 3500...
>>
>> However, I think I need some more background on this change. We are 
>> choosing to allocating into a single VirtualSpace instead of using 
>> separate mallocs. I understand that this will reduce the number of 
>> mallocs we do, but is that really the problem? The CR says that we 
>> run out of memory due to the fact that the GC threads allocate too 
>> much memory. We will still allocate the same amount of memory just 
>> mmapped instead of malloced, right? Do we have any test cases that 
>> fail before your fix but pass now?
> We didn't run out of memory - we ran out of C heap.
>
> There is an issue on Solaris (I have to check whether Ramki's email 
> describes the underlying problem) where by we asked for a 26Gb heap 
> (which we got) but the C heap was only given 2Gb. Vladimir saw the 
> issue during COOPs testing. In this situation we ran out of C heap 
> trying to allocate the work queues for the marking threads.
>
> Azeem submitted an issue a few weeks ago as well that ran into the 
> same problem (this time with ParallelGC/ParallelOld) and the process 
> died while trying to allocate the PSPromotionManager instances (and 
> their internal work queue) - see CR 7167969. I believe that 7197666 is 
> similar.
>
> A real reduction in memory use will come from reducing the task queue 
> size. I want to do that in a separate CR (for some or all of the 
> collectors)  provided there is no real performance difference.
>
>>
>> In fact, isn't the risk higher (at least on 32 bit platforms) that we 
>> fail to allocate the bitmaps now that we try to allocate them from 
>> one consecutive chunk of memory instead of several smaller ones? I'm 
>> thinking that if the memory is fragmented we might not get the 
>> contiguous memory we need for the VirtualSpace. But I am definitely 
>> no expert in this.
>
> Good question and I wish I knew the answer. The occupancy of the task 
> card bitmaps depends on the # of GC threads and the heap size (1 bit 
> per card) - I guess it might be possible. Should I then only be 
> allocating out of the backing store in a 64-bit JVM?
>>
>> With the above reservation I think your change looks good. Here are a 
>> couple of minor comments:
>>
>> CMMarkStack::allocate(size_t size)
>> "size" is kind of an overloaded name for an "allocate" method. Could 
>> we call it "capacity" or "number_of_entries"?
>
> Sure. Done. I renamed 'size' to 'capacity' - which matches the field 
> name. But you bring up a good point about this confusion. If we look 
> at the CMS version of the same routine:
>
> bool CMSMarkStack::allocate(size_t size) {
>   // allocate a stack of the requisite depth
>   ReservedSpace rs(ReservedSpace::allocation_align_size_up(
>                    size * sizeof(oop)));
>   if (!rs.is_reserved()) {
>     warning("CMSMarkStack allocation failure");
>     return false;
>   }
>   if (!_virtual_space.initialize(rs, rs.size())) {
>     warning("CMSMarkStack backing store failure");
>     return false;
>   }
>   assert(_virtual_space.committed_size() == rs.size(),
>          "didn't reserve backing store for all of CMS stack?");
>   _base = (oop*)(_virtual_space.low());
>   _index = 0;
>   _capacity = size;
>   NOT_PRODUCT(_max_depth = 0);
>   return true;
> }
>
> it used the parameter as the real size in bytes of the marking stack. 
> In G1, since we used NEW_C_HEAP_ARRAY, the size was multiplied by the 
> size of an oop and we got a larger marking stack than requested. 
> Instead of a 16 Mb stack (128 * 128K), we got a 128Mb stack (8 * 128 * 
> 128K). Basically we asked for the equivalent of the task queues for 
> another 128 threads:
>
> dr-evil{jcuthber}:265> ./test.sh -d64 -XX:-ZapUnusedHeapArea 
> -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g 
> -XX:+UseG1GC -XX:+PrintMallocStatistics
> Using java runtime at: 
> /net/jre.us.oracle.com/p/v42/jdk/7/fcs/b147/binaries/solaris-sparcv9/jre
> size: 16777216, MarkStackSize: 16777216, TASKQUEUE_SIZE: 131072
> os::malloc 134217728 bytes --> 0x00000001010709e8
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 25.0-b03-internal-fastdebug, 
> mixed mode)
> allocation stats: 24489 mallocs (212MB), 1147 frees (0MB), 4MB resrc
>
> Mimicking CMS is probably the way to go here.
>>
>>
>> Snippet In ConcurrentMark::ConcurrentMark() we use both shifting and 
>> division to accomplish the same thing on the lines after each other. 
>> Could we maybe use the same style for both cases?
>>
>>   // Card liveness bitmap size (in bits)
>>   BitMap::idx_t card_bm_size = (heap_rs.size() + 
>> CardTableModRefBS::card_size - 1)
>>                                 >> CardTableModRefBS::card_shift;
>>   // Card liveness bitmap size (in bytes)
>>   size_t card_bm_size_bytes = (card_bm_size + (BitsPerByte - 1)) / 
>> BitsPerByte;
>
> Sure. Changed the latter one to use LogBitsPerByte.
>
>> Also in ConcurrentMark::ConcurrentMark() I think  that 
>> "marked_bytes_size_bytes" is not such a great name. Probably we could 
>> skip the first "bytes" in "marked_bytes_size" and just call 
>> "marked_bytes_size_bytes" for "marked_size_bytes".
>
> OK. What about just marked_bytes_size?
>>
>> I think it would be nice to factor some of the new stuff in 
>> ConcurrentMark::ConcurrentMark() out into methods. Both the 
>> calculations of the sizes and the creation/setting of the bitmaps. 
>> But I admit that this is just a style issue.
>
> OK. I'll try a few things to pretty it up.
>
> Expect a new webrev soon.
>
> JohnC

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20120927/3559d4ad/attachment.htm>