RFR: JDK-8133706: Kitchensink hanged

Sat Sep 19 11:19:42 UTC 2015

Hi everyone,

Could I have a couple of reviews for this change?

http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8133706

The bug report contains some good analysis from Eric Caspole, David 
Holmes and Yumin Qi.

Thanks for this valuable investigation! Basically the problem is that 
the heap regions that can be reclaimed by the cleanup phase are not made 
available until they have been cleaned up in a concurrent phase. If a GC 
happens while we are doing the concurrent cleaning up of the free 
regions, the GC will wait for the concurrent cleaning to finish, either 
by calling new_region_try_secondary_free_list() or 
wait_while_free_regions_coming(). But since the logging before we start 
the concurrent cleaning is now joining the STS, that cleaning is getting 
blocked by the GC safepoint. So, we have a deadlock.

This was actually documented in the ConcurrentMarkThread::run() method, 
but I missed that when I added the cm_log() calls.

         // Notify anyone who's waiting that there are no more free
         // regions coming. We have to do this before we join the STS
         // (in fact, we should not attempt to join the STS in the
         // interval between finishing the cleanup pause and clearing
         // the free_regions_coming flag) otherwise we might deadlock:
         // a GC worker could be blocked waiting for the notification
         // whereas this thread will be blocked for the pause to finish
         // while it's trying to join the STS, which is conditional on
         // the GC workers finishing.

The simplest fix that I could come up with was to move the logging from 
the  ConcurrentMarkThread::run() method in to the very end of the 
stopped part of the cleanup phase. This ensures that we don’t mix the 
log output with any logging that a GC does but id does not require 
joining the STS since we are already at a safepoint.

I left the timing (logged as part of the “GC concurrent-cleanup-end” 
entry) unchanged. This means that there could be a slight mismatch 
between the timestamps for “concurrent-cleanup-start” and 
“concurrent-cleanup-end” and the time logged by 
“concurrent-cleanup-end”. I hope the simplicity of the change outweighs 
the disadvantage of this mismatch.

Thanks,
Bengt