RFC: call report_java_out_of_memory_error() for -XX:AbortVMOnException=java.lang.OutOfMemoryError

Thu Sep 9 10:00:17 UTC 2021

hi, Volker and David,

Thanks for the comments. I follow this idea and it works. At least it
works with '-XX:AbortVMOnException=java.lang.OutOfMemoryError'. I can
use -XX:OnError='jcmd %p' to dump threads or heap. It can be done
without a regression.
https://github.com/navyxliu/jdk/actions/runs/1216082580

I found that I am not the only person trying to do so. JFR does the
similar thing in even harder situation.
https://github.com/openjdk/jdk/blob/master/src/hotspot/share/jfr/recorder/repository/jfrEmergencyDump.cpp#L543

Coredump works but it will generate a very big file which is hard to
transfer and may contain sensitive data. Further, it's not all java
developers know how to analyze a coredump. It require both executable
and debuginfo.

fatal() doesn't do anything disrupted, all threads should be in good
shape when it executes OnError callbacks. I think we can use
OnError=jcmd %p as complementary approach. It's deadlock now, we have
nothing to lose.

thanks,
--lx

On 9/8/21 3:08 PM, David Holmes wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On 9/09/2021 4:48 am, Liu, Xin wrote:
>> hi, Volker,
>>
>> I think it's possible to allow OnError=jcmd to attach to the parent
>> process. HotSpot defines OnError as "Run user-defined commands on fatal
>> error; see VMError.cpp for examples". It's a callback for fatal errors.
>> fatal() means HotSpot starts aborting and does not notify other threads.
>>   In other words, other threads are in normal states. I observe so in
>> gdb.  All other threads are successful to enter safepoint safe state
>> except the main java thread.
>>
>> I am not advocating using jcmd %p in OnError. I am exploring a
>> possibility. Currently, we will end up a deadlock if you do so. If we
>> made it, we would at least get more information from Hotspot. We will
>> fail-fast if something bad happen. I think fail-fast is still better
>> than a hanging process.
> 
> It may be possible to fix the safepoint deadlock by transitioning the
> thread executing the OnError command to _thread_in_native beforehand -
> but there are constraints on doing that e.g. no oocks can be held. And
> if the error is processed in the VMThread then there is nothing that can
> be done.
> 
> But I agree with Volker that given a fatal error has been encountered,
> trying to report other information about the VM in a live manner is
> fraught with peril. Using a core dump for post mortem analysis would
> probably be better.
> 
> Cheers,
> David
> 
>> thanks,
>> --lx
>>
>>
>>
>>
>> On 9/8/21 10:35 AM, Volker Simonis wrote:
>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>
>>>
>>>
>>> I'm not sure if running a jcmd process which attaches to the dying VM
>>> as part of the OnError scripts is a use case we really want to
>>> support?
>>>
>>> There's a reason why the VM is crashing and attaching to this dying VM
>>> will most probably only cause other follow-up errors.
>>>
>>> On Wed, Sep 8, 2021 at 7:22 PM Liu, Xin <xxinliu at amazon.com> wrote:
>>>>
>>>> Hi, David,
>>>>
>>>> Thanks for the head-up. yes, it works for me.
>>>>
>>>> There's one more thing. One drawback is that the script providing to
>>>> OnError can't trap hotspot itself or we end up with a deadlock.
>>>>
>>>>
>>>> If we use 'jcmd %p Thread.print' or 'jcmd %p GC.heap_dump <file>' in
>>>> OnError=, (%p means the java process itself), the main java thread which
>>>> is waiting for os::fork_and_exec(cmd) will prevent hotspot reach to the
>>>> safepoint. It's deadlock because no safepoint mean fork_and_exec can't
>>>> complete.
>>>>
>>>> eg.
>>>> $java -Xmx50m -XX:AbortVMOnException=java.lang.OutOfMemoryError
>>>> -XX:OnError='jcmd %p Thread.print' -XX:+SafepointTimeout OomDumpExample
>>>> direct
>>>> # To suppress the following error report, specify this argument
>>>> # after -XX: or in .hotspotrc:  SuppressErrorAt=/exceptions.cpp:541
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> #  Internal Error
>>>> (/home/xxinliu/Devel/jdk/src/hotspot/share/utilities/exceptions.cpp:541), pid=107552,
>>>> tid=107553
>>>> #  fatal error: Saw java.lang.OutOfMemoryError, aborting
>>>> #
>>>> # JRE version: OpenJDK Runtime Environment (18.0) (slowdebug build
>>>> 18-internal+0-adhoc.xxinliu.jdk)
>>>> # Java VM: OpenJDK 64-Bit Server VM (slowdebug
>>>> 18-internal+0-adhoc.xxinliu.jdk, mixed mode, tiered, compressed oops,
>>>> compressed class ptrs, g1 gc, linux-amd64)
>>>> # Problematic frame:
>>>> # V  [libjvm.so+0x924e8c]  Exceptions::debug_check_abort(char const*,
>>>> char const*)+0x8a
>>>> #
>>>> # No core dump will be written. Core dumps have been disabled. To enable
>>>> core dumping, try "ulimit -c unlimited" before starting Java again
>>>> #
>>>> # An error report file with more information is saved as:
>>>> # /local/home/xxinliu/JDK-2085/hs_err_pid107552.log
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> #   https://bugreport.java.com/bugreport/crash.jsp
>>>> #
>>>> #
>>>> # -XX:OnError="jcmd %p Thread.print"
>>>> #   Executing /bin/sh -c "jcmd 107552 Thread.print" ...
>>>> 107552:
>>>> [13.045s][warning][safepoint]
>>>> [13.045s][warning][safepoint] # SafepointSynchronize::begin: Timeout
>>>> detected:
>>>> [13.045s][warning][safepoint] # SafepointSynchronize::begin: Timed out
>>>> while spinning to reach a safepoint.
>>>> [13.045s][warning][safepoint] # SafepointSynchronize::begin: Threads
>>>> which did not reach the safepoint:
>>>> [13.045s][warning][safepoint] # "main" #1 prio=5 os_prio=0 cpu=1552.12ms
>>>> elapsed=13.04s tid=0x00007f43600278e0 nid=107553 runnable
>>>> [0x00007f4369d9f000]
>>>> [13.045s][warning][safepoint]    java.lang.Thread.State: RUNNABLE
>>>> [13.045s][warning][safepoint] Thread: 0x00007f43600278e0  [0x1a421]
>>>> State: _running _at_poll_safepoint 0
>>>> [13.045s][warning][safepoint]    JavaThread state: _thread_in_vm
>>>> [13.045s][warning][safepoint]
>>>> [13.045s][warning][safepoint] # SafepointSynchronize::begin: (End of list)
>>>>
>>>>
>>>> I haven't figured out how yet, but I think I can lift this constraint.
>>>> Once I did, OnError would have more freedom to dump thread or heap
>>>> before dieing. Can I file bug about this?
>>>>
>>>> thanks,
>>>> --lx
>>>>
>>>>
>>>> On 8/30/21 9:26 PM, David Holmes wrote:
>>>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 28/08/2021 4:54 am, Liu, Xin wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Recently I revisit JDK-8155004/JDK-8257790 because a new team trip over.
>>>>>> -XX:AbortVMOnException=java.lang.OutOfMemoryError works. I wonder
>>>>>> whether it is a good idea to call report_java_out_of_memory_error() when
>>>>>> OOME is trapped. In this way, HotSpot will trigger OnOutOfMemoryError
>>>>>> callbacks.
>>>>>
>>>>> Why not just use AbortVMOnException together with OnError to get the
>>>>> callbacks?
>>>>>
>>>>> Cheers,
>>>>> David
>>>>>
>>>>>> I understand JDK-8257790 is not a bug. I don't want to overthrow that
>>>>>> conclusion. I just wonder if we can handle it better in the presence of
>>>>>> -XX:AbortVMOnException=java.lang.OutOfMemoryError.
>>>>>>
>>>>>> For Java webservers, OOME may lead to a zombie process. We may have a
>>>>>> bug in code or indeed run out of memory. OOME is suppressed or terminate
>>>>>> the thread but don't terminate the java process. eg.
>>>>>>
>>>>>> public class Main {
>>>>>>       volatile static boolean done = false;
>>>>>>
>>>>>>       public static void main(String[] args) {
>>>>>>           String msg = "a long long message.";
>>>>>>           // write your code here
>>>>>>           Runnable runnable = () -> {
>>>>>>               int cnt = Integer.MAX_VALUE / msg.length() + 1;
>>>>>>               //it will throw a OutOfMemoryError.
>>>>>>               msg.repeat(cnt);
>>>>>>               done = true;
>>>>>>           };
>>>>>>
>>>>>>           Thread thread = new Thread(runnable);
>>>>>>           thread.start();
>>>>>>           while(!done) {
>>>>>>           } // this simulates the main loop of event handling
>>>>>>       }
>>>>>> }
>>>>>>
>>>>>> Java developers can use
>>>>>> -XX:AbortVMOnException=java.lang.OutOfMemoryError to exercise fail-fast
>>>>>> principle. Java web application which handle traffics are usually
>>>>>> distributed in a cluster. A failure of a single host usually is not a
>>>>>> big deal. As long as java exits, it's easy to restart and backfill it.
>>>>>>
>>>>>> My proposing change is very simple. Just call
>>>>>> report_java_out_of_memory() if value_string is OOME. It's no-op if users
>>>>>> never specify anything. If they do specify flags like
>>>>>> Crash/ExitOnOutOfMemory,  OnOutOfMemoryError or
>>>>>> HeapDumpOnOutOfMemoryError, HotSpot will let report_java_out_of_memory
>>>>>> does the cleanup job. fatal() works but too brutal. I think we should
>>>>>> let java exits with error code.
>>>>>>
>>>>>>
>>>>>> diff --git a/src/hotspot/share/utilities/exceptions.cpp
>>>>>> b/src/hotspot/share/utilities/exceptions.cpp
>>>>>> index bd95b8306be..fd8a83deaf3 100644
>>>>>> --- a/src/hotspot/share/utilities/exceptions.cpp
>>>>>> +++ b/src/hotspot/share/utilities/exceptions.cpp
>>>>>> @@ -538,6 +538,9 @@ void Exceptions::debug_check_abort(const char
>>>>>> *value_string, const char* message
>>>>>>          strstr(value_string, AbortVMOnException)) {
>>>>>>        if (AbortVMOnExceptionMessage == NULL || (message != NULL &&
>>>>>>            strstr(message, AbortVMOnExceptionMessage))) {
>>>>>> +      if(!strcmp(value_string, "java.lang.OutOfMemoryError")) {
>>>>>> +        report_java_out_of_memory(message);
>>>>>> +      }
>>>>>>          fatal("Saw %s, aborting", value_string);
>>>>>>        }
>>>>>>      }
>>>>>>
>>>>>>
>>>>>> thanks,
>>>>>> --lx
>>>>>>