RFR: JDK-8133706: Kitchensink hanged

Mon Sep 21 18:08:34 UTC 2015

Bengt,

It's correct that if the else is taken

  178       } else {
  179         // We don't want to update the marking status if a GC pause
  180         // is already underway.
  181         SuspendibleThreadSetJoiner sts_join;
  182         g1h->collector_state()->set_mark_in_progress(false);
  183       }

then free_regions_coming() will always return false?
If yes, then the change looks good.

Also if you choose to move the block
of code (lines 185-214)

  185       // Check if cleanup set the free_regions_coming flag. If it
  186       // hasn't, we can just skip the next step.
  187       if (g1h->free_regions_coming()) {
...

  214       }

after

  177         VMThread::execute(&op);

I'll gladly code review the move of that code (but it is not
necessary).

Jon

On 09/19/2015 04:19 AM, Bengt Rutisson wrote:
>
> Hi everyone,
>
> Could I have a couple of reviews for this change?
>
> http://cr.openjdk.java.net/~brutisso/8133706/webrev.00/
> https://bugs.openjdk.java.net/browse/JDK-8133706
>
> The bug report contains some good analysis from Eric Caspole, David 
> Holmes and Yumin Qi.
>
> Thanks for this valuable investigation! Basically the problem is that 
> the heap regions that can be reclaimed by the cleanup phase are not 
> made available until they have been cleaned up in a concurrent phase. 
> If a GC happens while we are doing the concurrent cleaning up of the 
> free regions, the GC will wait for the concurrent cleaning to finish, 
> either by calling new_region_try_secondary_free_list() or 
> wait_while_free_regions_coming(). But since the logging before we 
> start the concurrent cleaning is now joining the STS, that cleaning is 
> getting blocked by the GC safepoint. So, we have a deadlock.
>
> This was actually documented in the ConcurrentMarkThread::run() 
> method, but I missed that when I added the cm_log() calls.
>
>         // Notify anyone who's waiting that there are no more free
>         // regions coming. We have to do this before we join the STS
>         // (in fact, we should not attempt to join the STS in the
>         // interval between finishing the cleanup pause and clearing
>         // the free_regions_coming flag) otherwise we might deadlock:
>         // a GC worker could be blocked waiting for the notification
>         // whereas this thread will be blocked for the pause to finish
>         // while it's trying to join the STS, which is conditional on
>         // the GC workers finishing.
>
>
> The simplest fix that I could come up with was to move the logging 
> from the  ConcurrentMarkThread::run() method in to the very end of the 
> stopped part of the cleanup phase. This ensures that we don’t mix the 
> log output with any logging that a GC does but id does not require 
> joining the STS since we are already at a safepoint.
>
> I left the timing (logged as part of the “GC concurrent-cleanup-end” 
> entry) unchanged. This means that there could be a slight mismatch 
> between the timestamps for “concurrent-cleanup-start” and 
> “concurrent-cleanup-end” and the time logged by 
> “concurrent-cleanup-end”. I hope the simplicity of the change 
> outweighs the disadvantage of this mismatch.
>
> Thanks,
> Bengt