RFR: 8302073: Specifying OnError handler prevents WatcherThread to break a deadlock in report_and_die()
jsolomon8080
duke at openjdk.org
Wed Mar 8 21:10:16 UTC 2023
On Wed, 8 Mar 2023 14:05:44 GMT, Alexey Pavlyutkin <duke at openjdk.org> wrote:
> The patch fixes error reporting to check timeout in the case when a user specifies OnError hander. Before VMError:check_timeout() ignored timeout in this case, and so didn't break malloc() deadlock.
>
> Verification (amd64/20.04LTS): the idea of the test is to crash JVM running with error hander of 3 successive `sleep` commands for 1s, 10s, and 60s with and without specified timeout
>
>
> 16:52:17 at alex@alex-VirtualBox>( echo "
> public class C {
> public static void main(String[] args) throws Throwable {
>> while (true) Thread.sleep(1000);
>> }
>> }
>> " >> C.java )
> 16:57:35 at alex@alex-VirtualBox>./build/linux-x86_64-server-release/images/jdk/bin/java -XX:OnError='sleep 1;sleep 10;sleep 60' ./C.java &
> [2] 179574
> 17:00:19 at alex@alex-VirtualBox>kill -s SIGSEGV 179574
> 17:00:27 at alex@alex-VirtualBox>#
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00007f7b1701ecd5 (sent by kill), pid=179574, tid=179574
> #
> # JRE version: OpenJDK Runtime Environment (21.0) (build 21-internal-adhoc.alex.jdk)
> # Java VM: OpenJDK 64-Bit Server VM (21-internal-adhoc.alex.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # C [libpthread.so.0+0x9cd5] __pthread_clockjoin_ex+0x255
> #
> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/alex/jdk/core.179574)
> #
> # An error report file with more information is saved as:
> # /home/alex/jdk/hs_err_pid179574.log
> #
> # If you would like to submit a bug report, please visit:
> # https://bugreport.java.com/bugreport/crash.jsp
> #
> #
> # -XX:OnError="sleep 1;sleep 10;sleep 60"
> # Executing /bin/sh -c "sleep 1" ...
> # Executing /bin/sh -c "sleep 10" ...
> # Executing /bin/sh -c "sleep 60" ...
>
> [2]+ Aborted (core dumped) ./build/linux-x86_64-server-release/images/jdk/bin/java -XX:OnError='sleep 1;sleep 10;sleep 60' ./C.java
> 17:02:03 at alex@alex-VirtualBox>./build/linux-x86_64-server-release/images/jdk/bin/java -XX:ErrorLogTimeout=5 -XX:OnError='sleep 1;sleep 10;sleep 60' ./C.java &
> [2] 179602
> 17:02:32 at alex@alex-VirtualBox>kill -s SIGSEGV 179602
> 17:02:41 at alex@alex-VirtualBox>#
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00007f9d71b18cd5 (sent by kill), pid=179602, tid=179602
> #
> # JRE version: OpenJDK Runtime Environment (21.0) (build 21-internal-adhoc.alex.jdk)
> # Java VM: OpenJDK 64-Bit Server VM (21-internal-adhoc.alex.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
> # Problematic frame:
> # C [libpthread.so.0+0x9cd5] __pthread_clockjoin_ex+0x255
> #
> # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/alex/jdk/core.179602)
> #
> # An error report file with more information is saved as:
> # /home/alex/jdk/hs_err_pid179602.log
> #
> # If you would like to submit a bug report, please visit:
> # https://bugreport.java.com/bugreport/crash.jsp
> #
> #
> # -XX:OnError="sleep 1;sleep 10;sleep 60"
> # Executing /bin/sh -c "sleep 1" ...
> # Executing /bin/sh -c "sleep 10" ...
>
> ------ Timeout during error reporting after 11 s. ------
>
> 17:02:54 at alex@alex-VirtualBox>
>
>
> Regression (amd64/20.04LTS): `test/hotspot/jtreg/runtime/ErrorHandling` with different combinations of `-vmoption:-XX:ErrorLogTimeout=10` and `-vmoption:-XX:OnError='sleep 10'`
Hi - I'm the originator of this bug report. I'm glad this is getting fixed and I will ultimately defer to the java experts since I have never contributed code to this code base, but if it were me, I'd do the simplest thing possible. The solution presented here seems overly complex. I don't understand why an OnError script or an abort_hook is so special. I think it's up to the user to ensure that neither takes longer than ErrorLogTimeout, which defaults to 2 minutes. That's an eternity. Do java users really expect to wait 2 minutes for their processes to exit?
If this were my code, I would remove any guarantee about OnError and make it responsibility of the user to set ErrorLogTimeout appropriately.
I understand that there may be many users who have counted on this guarantee and you can't break them now. I also understand that I'm not aware of all the ways that OnError is used. I'm sure we will get to some solution that will fix the real problem, which is what I care about the most. Thank you.
-------------
PR: https://git.openjdk.org/jdk/pull/12925
More information about the hotspot-dev
mailing list