RFR: 8273608: Deadlock when jcmd of OnError attaches to itself [v5]

Wed Oct 6 13:28:10 UTC 2021

On Wed, 6 Oct 2021 06:30:36 GMT, Xin Liu <xliu at openjdk.org> wrote:

>> This patch allows the custom commands of OnError to attach to HotSpot itself. 
>> It sets the thread of report_and_die() to Native before os::fork_and_exec(cmd). 
>> This prevents cmds which require safepoint synchronization from deadlock.
>> eg. OnError='jcmd %p Thread.print'.
>> 
>> Without this patch, we will encounter a deadlock at safepoint synchronization. 
>> `"main" #1`  is the very thread which executes `os::fork_and_exec(cmd)`.  
>> 
>> 
>> Aborting due to java.lang.OutOfMemoryError: Java heap space
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  Internal Error (debug.cpp:364), pid=94632, tid=94633
>> #  fatal error: OutOfMemory encountered: Java heap space
>> #
>> # JRE version: OpenJDK Runtime Environment (18.0) (build 18-internal+0-adhoc.xxinliu.jdk)
>> # Java VM: OpenJDK 64-Bit Server VM (18-internal+0-adhoc.xxinliu.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
>> # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /local/home/xxinliu/JDK-2085/hs_err_pid94632.log
>> #
>> # -XX:OnError="jcmd %p Thread.print"
>> #   Executing /bin/sh -c "jcmd 94632 Thread.print" ...
>> 94632:
>> [10.616s][warning][safepoint]
>> [10.616s][warning][safepoint] # SafepointSynchronize::begin: Timeout detected:
>> [10.616s][warning][safepoint] # SafepointSynchronize::begin: Timed out while spinning to reach a safepoint.
>> [10.616s][warning][safepoint] # SafepointSynchronize::begin: Threads which did not reach the safepoint:
>> [10.616s][warning][safepoint] # "main" #1 prio=5 os_prio=0 cpu=236.97ms elapsed=10.61s tid=0x00007f01b00232f0 nid=94633 runnable  [0x00007f01b7a08000]
>> [10.616s][warning][safepoint]    java.lang.Thread.State: RUNNABLE
>> [10.616s][warning][safepoint]
>> [10.616s][warning][safepoint] # SafepointSynchronize::begin: (End of list)
>
> Xin Liu has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update per reviewers' feedbacks.
>   
>   1. rename the test to TestOnErrorWithSelfAttachingJCmd.java.
>   2. rename java_current_transition_into_native() and add the static qualifier.
>   3. make unlock_locks_on_error more general. ensure formal name is consistent
>   with argument name.

> _Mailing list message from [David Holmes](mailto:david.holmes at oracle.com) on [hotspot-dev](mailto:hotspot-dev at mail.openjdk.java.net):_
> 
> On 6/10/2021 4:42 pm, Thomas Stuefe wrote:
> 
> > On Wed, 6 Oct 2021 00:37:41 GMT, Xin Liu <xliu at openjdk.org> wrote:
> > > I don't like workaround. I intend to do thing right. In my understanding, fork_and_exec() is going to step into Native, isn't it?
> > 
> > 
> > The problem I have with this is that force-unlocking all mutexes changes the state and possibly destroys evidence in core files.
> > For instance, let's say you have a crash in a thread owning a piece of memory, and unlocking all mutexes unlocks the mutex guarding this memory, lets other threads access and change the memory. The core file will reflect this. And this will be almost impossible to prove too.
> > I really would prefer to limit this workaround to the specific use case where it is needed.
> 
> I admit that the more I think about this the more I think unlocking the mutexes is the wrong thing to do. It introduces non-determinism to the crash and as Thomas says the state can change radically by the time the core dump is taken - thus defeating the whole point of the core dump being useful to analyse the crash.
> 
> That said I don't see how this can be limited to the specific usecase of "jcmd %p" ??
> 
> I think a different tack is needed here. We want the thread calling fork_and_exec to be seen as safepoint-safe, so perhaps the simplest hack is to switch to _thread_blocked (in a safe manner).
> 
> David -----

A simple compromise would have been to just limit the workaround to OOM scenarios. It's a one-line change in VMError and also would reduce scenarios we need to maintain. And it makes sense, since with OOMs one usually uses heap dumps or thread dumps, which you need jcmd for (*). In crash scenarios OTOH one usually needs cores. In crash scenarios, it makes rarely sense to self-attach with jcmd because god knows what would happen.

Further down the line, a better improvement may be to not spawn jcmd at all but to provide specialized support for running jcmd commands on OOM inside the VM, without the help of an intermediate child process. Sort of like `-XX:CommandOnOOM=VM.info; GC.heap_dump my_dump; Thread.print"` . 

(*) just occurred to me that we have already HeapDumpOnOutOfMemoryError, so attaching jcmd to the VM to do a heap dump is not really needed anyway.

-------------

PR: https://git.openjdk.java.net/jdk/pull/5590