VM deadlock between VM shutdown and G1

Thu Sep 13 12:00:16 UTC 2018

Hi Kris,

Okay got it now.

So basically:

Universe::heap()->stop();

has to be written in such a way that it can safely be called even if 
there's a GC related VMop in execution.

Cheers,
David

On 13/09/2018 8:14 PM, Krystal Mok wrote:
> Hi David,
> 
> Comments inline:
> 
> On Thu, Sep 13, 2018 at 2:51 AM, David Holmes <david.holmes at oracle.com 
> <mailto:david.holmes at oracle.com>> wrote:
> 
>     Hi Kris,
> 
>     I didn't quite follow the analysis (see below)
> 
>     On 13/09/2018 6:57 PM, Krystal Mok wrote:
> 
>         1. A Java application thread at an allocation site triggering a G1
>         incremental collection
>         2. A thread that called System.exit(), initiating the VM
>         shutdown sequence.
>         It's in VM's native code so it doesn't block a safepoint.
> 
> 
>     VM code is not "native" in the sense of being safepoint-safe. If
>     it's still in the System.c code trying to call the VM then it is
>     native but as soon as it tries to enter the VM it will block if a
>     safepoint is in progress. In addition the exit requires that the VM
>     go to a safepoint before terminating.
> 
> The time window was extremely narrow but it did happen in practice.
> The Java application thread called System.exit() -> JVM_Halt(), where 
> the safepoint was probably not-yet active so it went past the safepoint 
> check upon entry and gets into _thread_in_vm state, and then when it 
> tries to post the VM death event it transitions to _thread_in_native 
> state (through JvmtiJavaThreadEventTransition, which is then safepoint 
> safe), and probably at around this time the safepoint synchronization 
> started and then stopped the world.
> 
> There's nothing really special about what the cbVMDeath otherwise. This 
> process has enabled the JDWP agent and that's a part of the story how 
> this thread got into a _thread_in_native state.
> 
>         3. VM thread, already inside of a safepoint and started running G1's
>         incremental collection.
>         (4. "the world" is at a safepoint so all other Java threads are just
>         waiting)
> 
>         The problem is, Thread 2 has already run half way into
>         before_exit(), and
> 
> 
>     The problem seems to be an event callback, cbVMDeath, which seems to
>     have take the thread from _thread_in_vm (which is not a
>     safepoint-safe state) to presumably _thread_in_native, which is
>     safepoint-safe. The callback then blocks on a RawMonitorWait for
>     something and that would seem to be where the problem arises. What
>     is the callback trying to do?
> 
> 
> The cbVMDeath callback is just waiting for other active callbacks to 
> check in (line 1273):
> 
> jdk/src/share/back/eventHandler.c
> 
> 1267     debugMonitorEnter(callbackBlock); {
> 1268         debugMonitorEnter(callbackLock); {
> 1269             vm_death_callback_active = JNI_TRUE;
> 1270             (void)threadControl_resumeAll();
> 1271             while (active_callbacks > 0) {
> 1272                 /* wait for active CALLBACKs to check in (and block) */
> 1273                 debugMonitorWait(callbackLock);
> 1274             }
> 1275         } debugMonitorExit(callbackLock);
> 
> The real deadlock in this case should still be the fact that VM has 
> reached a safepoint at this point in time, but G1's waiting for the 
> concurrent marker to check in, yet it's already gone.
> 
>     Cheers,
>     David
> 
> 
> Thanks,
> Kris