<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix"><br>

      Hi John,<br>

      <br>

      Thanks for doing the thorough analysis and providing the numbers.

      Interesting that G1 does 21703 mallocs at startup whereas Parallel

      only does 3500...<br>

      <br>

      However, I think I need some more background on this change. We

      are choosing to allocating into a single VirtualSpace instead of

      using separate mallocs. I understand that this will reduce the

      number of mallocs we do, but is that really the problem? The CR

      says that we run out of memory due to the fact that the GC threads

      allocate too much memory. We will still allocate the same amount

      of memory just mmapped instead of malloced, right? Do we have any

      test cases that fail before your fix but pass now?<br>

      <br>

      In fact, isn't the risk higher (at least on 32 bit platforms) that

      we fail to allocate the bitmaps now that we try to allocate them

      from one consecutive chunk of memory instead of several smaller

      ones? I'm thinking that if the memory is fragmented we might not

      get the contiguous memory we need for the VirtualSpace. But I am

      definitely no expert in this.<br>

      <br>

      With the above reservation I think your change looks good. Here

      are a couple of minor comments:<br>

      <br>

      CMMarkStack::allocate(size_t size)

      <br>

      "size" is kind of an overloaded name for an "allocate" method.

      Could we call it "capacity" or "number_of_entries"?

      <br>

      <br>

      <br>

      <title>Snippet</title>

      In ConcurrentMark::ConcurrentMark() we use both shifting and

      division to accomplish the same thing on the lines after each

      other. Could we maybe use the same style for both cases?<br>

      <br>

        // Card liveness bitmap size (in bits)

      <br>

        BitMap::idx_t card_bm_size = (heap_rs.size() +

      CardTableModRefBS::card_size - 1)

      <br>

                                      >>

      CardTableModRefBS::card_shift;

      <br>

        // Card liveness bitmap size (in bytes)

      <br>

        size_t card_bm_size_bytes = (card_bm_size + (BitsPerByte - 1)) /

      BitsPerByte;

      <br>

      <br>

      <br>

      <br>

      Also in ConcurrentMark::ConcurrentMark() I think  that

      "marked_bytes_size_bytes" is not such a great name. Probably we

      could skip the first "bytes" in "marked_bytes_size" and just call

      "marked_bytes_size_bytes" for "marked_size_bytes".

      <br>

      <br>

      <br>

      I think it would be nice to factor some of the new stuff in

      ConcurrentMark::ConcurrentMark() out into methods. Both the

      calculations of the sizes and the creation/setting of the bitmaps.

      But I admit that this is just a style issue.

      <br>

      <br>

      <br>

      Thanks,<br>

      Bengt<br>

      <br>

      <br>

      <br>

      <br>

      <br>

      On 2012-09-20 21:15, John Cuthbertson wrote:<br>

    </div>

    <blockquote cite="mid:505B6B65.1060201@oracle.com" type="cite">

      Hi Everyone,<br>

      <br>

      Can I have a couple of volunteers review the changes for this CR -

      the

      webrev can be found at:

      <a moz-do-not-send="true" class="moz-txt-link-freetext"

        href="http://cr.openjdk.java.net/%7Ejohnc/7188263/webrev.0">http://cr.openjdk.java.net/~johnc/7188263/webrev.0</a>

      ?<br>

      <br>

      Summary:<br>

      Compared to the other collectors, G1 consumes much more C heap

      (even

      during start up):<br>

      <br>

      ParallelGC (w/o ParallelOld):<br>

      <br>

      <tt>dr-evil{jcuthber}:210> ./test.sh -d64

        -XX:-ZapUnusedHeapArea

        -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g

        -XX:+UseParallelGC -XX:+PrintMallocStatistics

        -XX:-UseParallelOldGC<br>

        java version "1.7.0"<br>

        Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>

        Java HotSpot(TM) 64-Bit Server VM (build

        24.0-b20-internal-fastdebug,

        mixed mode)<br>

        allocation stats: 3488 mallocs (12MB), 1161 frees (0MB), 4MB

        resrc</tt><br>

      <br>

      ParallelGC (w/ ParallelOld):<br>

      <br>

      <br>

      <tt>dr-evil{jcuthber}:211> ./test.sh -d64

        -XX:-ZapUnusedHeapArea

        -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g

        -XX:+UseParallelGC -XX:+PrintMallocStatistics<br>

        java version "1.7.0"<br>

        Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>

        Java HotSpot(TM) 64-Bit Server VM (build

        24.0-b20-internal-fastdebug,

        mixed mode)<br>

        allocation stats: 3553 mallocs (36MB), 1160 frees (0MB), 4MB

        resrc</tt><br>

      <br>

      G1:<br>

      <br>

      <tt>dr-evil{jcuthber}:212> ./test.sh -d64

        -XX:-ZapUnusedHeapArea

        -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g

        -XX:+UseG1GC -XX:+PrintMallocStatistics<br>

        java version "1.7.0"<br>

        Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>

        Java HotSpot(TM) 64-Bit Server VM (build

        24.0-b20-internal-fastdebug,

        mixed mode)<br>

        allocation stats: 21703 mallocs (212MB), 1158 frees (0MB), 4MB

        resrc</tt><br>

      <br>

      With the parallel collector, the main culprit is the work queues.

      For

      ParallelGC (without ParallelOIdGC) the amount of space allocated

      is

      around 1Mb per GC thread. For ParallelGC (with ParallelOldGC) this

      increases to around 3Mb per worker thread.  In G1, the main

      culprits

      are the global marking stack, the work queues (for both GC threads

      and

      marking threads), and some per-worker structures used for liveness

      accounting. This results in an additional 128Mb being allocated

      for the

      global marking stack and the amount allocated per-worker thread

      increases to around 7Mb. On some systems (specifically large

      T-series

      SPARC) this increase in C heap consumption can result in

      out-of-system-memory errors. These marking data structures are

      critical

      for G1. Reducing the sizes is a possible solution but increases

      the

      possibility of restarting marking due to overflowing the marking

      stack(s), lengthening marking durations, and increasing the chance

      of

      an

      evacuation failure and/or a Full GC.<br>

      <br>

      The solution we have adopted, therefore, is to allocate some of

      these

      marking data structures from virtual memory. This reduces the C

      heap

      consumption during start-up to:<br>

      <br>

      <tt>dr-evil{jcuthber}:216> ./test.sh -d64

        -XX:-ZapUnusedHeapArea

        -XX:CICompilerCount=1 -XX:ParallelGCThreads=10 -Xms20g -Xmx20g

        -XX:+UseG1GC -XX:+PrintMallocStatistics<br>

        java version "1.7.0"<br>

        Java(TM) SE Runtime Environment (build 1.7.0-b147)<br>

        Java HotSpot(TM) 64-Bit Server VM (build

        24.0-b18-internal-fastdebug,

        mixed mode)<br>

        allocation stats: 21682 mallocs (29MB), 1158 frees (0MB), 4MB

        resrc</tt><br>

      <br>

      The memory is still allocated - just not from C heap. With these

      changes, G1's C heap consumption is now approximately 2Mb for each

      worker thread (which are the work queues themselves):<br>

      <br>

      C heap consumption (Mb) / # of GC threads<br>

      <br>

      <table border="1" cellpadding="2" cellspacing="2" width="50%">

        <tbody>

          <tr>

            <td valign="top"><b>Collector / # GC Threads<br>

              </b> </td>

            <td valign="top"><b>1<br>

              </b> </td>

            <td valign="top"><b>2<br>

              </b> </td>

            <td valign="top"><b>3<br>

              </b> </td>

            <td valign="top"><b>3<br>

              </b> </td>

            <td valign="top"><b>5<br>

              </b> </td>

          </tr>

          <tr>

            <td valign="top">ParallelGC w/o ParallelOldGC<br>

            </td>

            <td valign="top">3<br>

            </td>

            <td valign="top">4<br>

            </td>

            <td valign="top">5<br>

            </td>

            <td valign="top">6<br>

            </td>

            <td valign="top">7<br>

            </td>

          </tr>

          <tr>

            <td valign="top">ParallelGC w/ ParallelOldGC<br>

            </td>

            <td valign="top">7<br>

            </td>

            <td valign="top">11<br>

            </td>

            <td valign="top">14<br>

            </td>

            <td valign="top">17<br>

            </td>

            <td valign="top">20<br>

            </td>

          </tr>

          <tr>

            <td valign="top">G1 before changes<br>

            </td>

            <td valign="top">149<br>

            </td>

            <td valign="top">156<br>

            </td>

            <td valign="top">163<br>

            </td>

            <td valign="top">170<br>

            </td>

            <td valign="top">177<br>

            </td>

          </tr>

          <tr>

            <td valign="top">G1 after changes<br>

            </td>

            <td valign="top">11<br>

            </td>

            <td valign="top">13<br>

            </td>

            <td valign="top">15<br>

            </td>

            <td valign="top">17<br>

            </td>

            <td valign="top">19<br>

            </td>

          </tr>

        </tbody>

      </table>

      <br>

      We shall also investigate reducing the work queue sizes, to

      further

      reduce the amount of C heap consumed,  for some or all of the

      collectors - but in a separate CR.<br>

      <br>

      Testing:<br>

      GC test suite on x64 and sparc T-series with a low marking

      threshold

      and marking verification enabled; jprt. Our reference workload was

      used

      to verify that there was no significant performance difference.<br>

      <br>

      Thanks,<br>

      <br>

      JohnC<br>

    </blockquote>

    <br>

  </body>

</html>