seem "SafepointTimeout" doesn't work well if sync takes too long ?

Fri Apr 26 21:10:28 PDT 2013

Hi runtime-dev,

we hit a long STW issue on our hbase product cluster this morning. 

$ java -version
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)

per gc log :

013-04-27T01:53:20.573+0800: 908762.747: [GC 908762.747: [ParNew
Desired survivor size 286313672 bytes, new threshold 15 (max 15)
- age   1:    8829448 bytes,    8829448 total
- age   2:    6856168 bytes,   15685616 total
- age   3:    4441968 bytes,   20127584 total
- age   4:    6649776 bytes,   26777360 total
- age   5:    6648056 bytes,   33425416 total
- age   6:   10841760 bytes,   44267176 total
- age   7:    4549016 bytes,   48816192 total
- age   8:    8867144 bytes,   57683336 total
- age   9:   10832696 bytes,   68516032 total
- age  10:   17123880 bytes,   85639912 total
- age  11:    8733744 bytes,   94373656 total
- age  12:    8737944 bytes,  103111600 total
- age  13:   15025816 bytes,  118137416 total
- age  14:    8706400 bytes,  126843816 total
- age  15:    4552584 bytes,  131396400 total
: 470209K->130073K(699072K), 0.1957920 secs] 1037249K->701557K(10136256K), 0.1961640 secs] [Times: user=0.22 sys=0.00, real=0.20 secs]
Heap after GC invocations=667 (full 0):
 par new generation   total 699072K, used 130073K [0x0000000560000000, 0x00000005a0000000, 0x00000005a0000000)
  eden space 349568K,   0% used [0x0000000560000000, 0x0000000560000000, 0x0000000575560000)
  from space 349504K,  37% used [0x000000058aab0000, 0x00000005929b66e0, 0x00000005a0000000)
  to   space 349504K,   0% used [0x0000000575560000, 0x0000000575560000, 0x000000058aab0000)
 concurrent mark-sweep generation total 9437184K, used 571484K [0x00000005a0000000, 0x00000007e0000000, 0x00000007e0000000)
 concurrent-mark-sweep perm gen total 524288K, used 39683K [0x00000007e0000000, 0x0000000800000000, 0x0000000800000000)
}
Total time for which application threads were stopped: 148.6415470 seconds

Fortunately, we enabled "-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1" options before, 

         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
908614.312: GenCollectForAllocation          [     328          1              2    ]      [     0     0148444     0   196    ]  0

seems we took too long for "sync" op (could any guys shed more lights about sync?),  then we found "SafepointTimeout" production option, but per my current
understanding it doesn't play on my scenario(sync took too long), just works well on spin&block op, right?  

Here is the relatived code snippet:

      if (SafepointTimeout && safepoint_limit_time < os::javaTimeNanos()) {
        print_safepoint_timeout(_spinning_timeout);
      }
      ...
      if (remaining_time < 0 || Safepoint_lock->wait(true, remaining_time / MICROUNITS)) {
        print_safepoint_timeout(_blocking_timeout);
      }
      ...
      if (PrintSafepointStatistics) {
        update_statistics_on_sync_end(os::javaTimeNanos());
      }

To me seems in PRODUCT version it doesn't invoke any "print_safepoint_timeout" methods on "sync" stage  .

How could i diagnoise such long sync issue, any pointers will be highly welcome!

Thanks,
Liang