RFR: JDK-8133706: Kitchensink hanged
Bengt Rutisson
bengt.rutisson at oracle.com
Sat Sep 19 11:19:42 UTC 2015
Hi everyone,
Could I have a couple of reviews for this change?
http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
https://bugs.openjdk.java.net/browse/JDK-8133706
The bug report contains some good analysis from Eric Caspole, David
Holmes and Yumin Qi.
Thanks for this valuable investigation! Basically the problem is that
the heap regions that can be reclaimed by the cleanup phase are not made
available until they have been cleaned up in a concurrent phase. If a GC
happens while we are doing the concurrent cleaning up of the free
regions, the GC will wait for the concurrent cleaning to finish, either
by calling new_region_try_secondary_free_list() or
wait_while_free_regions_coming(). But since the logging before we start
the concurrent cleaning is now joining the STS, that cleaning is getting
blocked by the GC safepoint. So, we have a deadlock.
This was actually documented in the ConcurrentMarkThread::run() method,
but I missed that when I added the cm_log() calls.
// Notify anyone who's waiting that there are no more free
// regions coming. We have to do this before we join the STS
// (in fact, we should not attempt to join the STS in the
// interval between finishing the cleanup pause and clearing
// the free_regions_coming flag) otherwise we might deadlock:
// a GC worker could be blocked waiting for the notification
// whereas this thread will be blocked for the pause to finish
// while it's trying to join the STS, which is conditional on
// the GC workers finishing.
The simplest fix that I could come up with was to move the logging from
the ConcurrentMarkThread::run() method in to the very end of the
stopped part of the cleanup phase. This ensures that we don’t mix the
log output with any logging that a GC does but id does not require
joining the STS since we are already at a safepoint.
I left the timing (logged as part of the “GC concurrent-cleanup-end”
entry) unchanged. This means that there could be a slight mismatch
between the timestamps for “concurrent-cleanup-start” and
“concurrent-cleanup-end” and the time logged by
“concurrent-cleanup-end”. I hope the simplicity of the change outweighs
the disadvantage of this mismatch.
Thanks,
Bengt
More information about the hotspot-gc-dev
mailing list