RFR: 8302073: Specifying OnError handler prevents WatcherThread to break a deadlock in report_and_die()

Wed Mar 8 18:14:07 UTC 2023

On Wed, 8 Mar 2023 14:05:44 GMT, Alexey Pavlyutkin <duke at openjdk.org> wrote:

> The patch fixes error reporting to check timeout in the case when a user specifies OnError hander. Before VMError:check_timeout() ignored timeout in this case, and so didn't break malloc() deadlock.
> 
> Verification (amd64/20.04LTS): the idea of the test is to crash JVM running with error hander of 3 successive `sleep` commands for 1s, 10s, and 60s with and without specified timeout
> 
> 
> 16:52:17 at alex@alex-VirtualBox>( echo "
> public class C {
>   public static void main(String[] args) throws Throwable {
>>     while (true) Thread.sleep(1000);
>>   }
>> }
>> " >> C.java )
> 16:57:35 at alex@alex-VirtualBox>./build/linux-x86_64-server-release/images/jdk/bin/java -XX:OnError='sleep 1;sleep 10;sleep 60' ./C.java &
> [2] 179574
> 17:00:19 at alex@alex-VirtualBox>kill -s SIGSEGV 179574
> 17:00:27 at alex@alex-VirtualBox>#
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f7b1701ecd5 (sent by kill), pid=179574, tid=179574
> #
> # JRE version: OpenJDK Runtime Environment (21.0) (build 21-internal-adhoc.alex.jdk)
> # Java VM: OpenJDK 64-Bit Server VM (21-internal-adhoc.alex.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # C  [libpthread.so.0+0x9cd5]  __pthread_clockjoin_ex+0x255
> #
> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/alex/jdk/core.179574)
> #
> # An error report file with more information is saved as:
> # /home/alex/jdk/hs_err_pid179574.log
> #
> # If you would like to submit a bug report, please visit:
> #   https://bugreport.java.com/bugreport/crash.jsp
> #
> #
> # -XX:OnError="sleep 1;sleep 10;sleep 60"
> #   Executing /bin/sh -c "sleep 1" ...
> #   Executing /bin/sh -c "sleep 10" ...
> #   Executing /bin/sh -c "sleep 60" ...
> 
> [2]+  Aborted                 (core dumped) ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:OnError='sleep 1;sleep 10;sleep 60' ./C.java
> 17:02:03 at alex@alex-VirtualBox>./build/linux-x86_64-server-release/images/jdk/bin/java -XX:ErrorLogTimeout=5 -XX:OnError='sleep 1;sleep 10;sleep 60' ./C.java &
> [2] 179602
> 17:02:32 at alex@alex-VirtualBox>kill -s SIGSEGV 179602
> 17:02:41 at alex@alex-VirtualBox>#
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f9d71b18cd5 (sent by kill), pid=179602, tid=179602
> #
> # JRE version: OpenJDK Runtime Environment (21.0) (build 21-internal-adhoc.alex.jdk)
> # Java VM: OpenJDK 64-Bit Server VM (21-internal-adhoc.alex.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # C  [libpthread.so.0+0x9cd5]  __pthread_clockjoin_ex+0x255
> #
> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/alex/jdk/core.179602)
> #
> # An error report file with more information is saved as:
> # /home/alex/jdk/hs_err_pid179602.log
> #
> # If you would like to submit a bug report, please visit:
> #   https://bugreport.java.com/bugreport/crash.jsp
> #
> #
> # -XX:OnError="sleep 1;sleep 10;sleep 60"
> #   Executing /bin/sh -c "sleep 1" ...
> #   Executing /bin/sh -c "sleep 10" ...
> 
> ------ Timeout during error reporting after 11 s. ------
> 
> 17:02:54 at alex@alex-VirtualBox>
> 
> 
> Regression (amd64/20.04LTS): `test/hotspot/jtreg/runtime/ErrorHandling` with different combinations of `-vmoption:-XX:ErrorLogTimeout=10` and `-vmoption:-XX:OnError='sleep 10'`

> Thinking this through some more, I'm starting to doubt we do the right thing here. It is certainly convoluted: we _are_ the reporting thread here, so what happens is that:
> 
> * after fork_and_exec, we call check_timeout
> * check_timeout will signal ourself
> * we receive the signal
> * we enter the secondary signal handler recursively
> * we then re-enter VMError::report_and_die
> * we then note that this is a timeout and print an error message and call os::die.
> 
> That is too complicated for my taste, and I don't know if there are any hidden issues with VM::check_timeout(). Since now, we call it from two threads, possibly, the reporting thread and the watcher thread. That function was not intended for concurrent usage.
> 
> And before thinking about the correct behavior, we need to clarify if the protection we grant an OnError invocation extends to the whole chain of error scripts. Right now we say OnError scripts should not be interrupted. Okay, but what about the next OnError script? If the user specifies several OnError scripts, should they all get a chance to run to finish?
> 
> Because denying the follow-up error scripts a chance to run feels weirdly arbitrary. Either those scripts are essential, or they aren't. If they are, all should run. If they are not, it should be okay to kill the JVM _while it is waiting for the child process to finish_ - this would make the patch simpler, and people argue that this would be the correct behavior anyway.
> 
> Personally, I think that killing the JVM while it is in waitpid waiting for the child is probably benign. Child would be reparented, possibly zombified on badly set up systems, but that's it. It would probably run to completion.

ok, let's replace check_timeout() with something like

jlong expiration = get_reporting_start_time() + ( jlong )ErrorLogTimeout * TIMESTAMP_TO_SECONDS_FACTOR;
if ( get_current_timestamp() > expiration ) break;

-------------

PR: https://git.openjdk.org/jdk/pull/12925