RFR(M): 8219584: Try to dump error file by thread which causes safepoint timeout
Thomas Stüfe
thomas.stuefe at gmail.com
Fri Feb 22 19:54:34 UTC 2019
Hi Martin,
this is certainly valuable.
Not a full review, just some remarks. I think one could make this quite a
bit simpler: The whole notion of storing a reason string and the sender TID
etc in the Thread class only serves diagnostic purposes - to output a clear
message in the hs-err file, right? I am not sure this is worth the added
complexity though, since we already have most of that information in the
hs-err file today:
"siginfo: si_signo: 8 (SIGFPE), si_code: -6 (SI_TKILL), si_addr:
0x0000040300007866 "
See "SI_TKILL" which means this signal was sent by another thread. The
"si_addr" info is bogus in this case. With a tiny patch in
os::print_siginfo() to tread SI_TKILL - if defined - like SI_USER, we could
change this to:
"siginfo: si_signo: 4 (SIGILL), si_code: -6 (SI_TKILL), si_pid: 3929
(current process), si_uid: 1027"
which would make more sense.
So, from the hs-err file we already know if a signal was sent by another
thread. Granted, the sending thread id is missing, as is the explicit
reason string for diagnostics. However, since the sending thread announces
itself in the event log "I have sent this signal to that thread" this
information should be there too.
So, my suggestion would be for the sake of simplicity to leave all this
communication of reason, sender tid etc to the target thread out. That
would also mean you can implement this independently from the Thread class.
You do not need a valid thread class to send a signal to a thread id.
--
A second thing, we have similar coding already in error reporting, see
VMError::interrupt_reporting_thread() in vmError_posix.cpp. Since this is
basically the same, we could consolidate and move that functionality to
os_posix.cpp, basically as a generic wrapper for pthread_kill. E.g.
os::Posix::interrupt_thread(pthread_t target).
--
Just my 5 cent. Lets see what others think.
Cheers, Thomas
On Fri, Feb 22, 2019 at 4:36 PM Doerr, Martin <martin.doerr at sap.com> wrote:
> Hi all,
>
> the VM supports diagnostic flags -XX:+SafepointTimeout and
> -XX:+AbortVMOnSafepointTimeout to detect safepoint synchronization timeouts
> and to exit with an error message.
> However, we usually don't see what the thread was doing which didn't reach
> the safepoint.
> We can get a more helpful hs_err file if we kill that thread and let it
> dump the hs_err file.
>
> My following proposal does:
>
> 1. Introduce a function for sending a signal to another thread (not for
> Windows).
> 2. If possible, send a SIGILL to thread which didn't reach safepoint.
> 3. Make SafepointALot diagnostic instead of develop in order to make it
> usable together with SafepointTimeout.
> 4. Extend error reporting to make it easy to recognize if the thread
> was killed by another thread.
> 5. Add a jtreg test.
>
> Webrev:
>
> http://cr.openjdk.java.net/~mdoerr/8219584_kill_thread_on_safepoint_timeout/webrev.00/
>
>
> The test contains a long running loop without safepoint compiled by C2.
> The new enhancement leads to an hs_err output (excerpt):
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGILL (0x4) at pc=0x00003be1001f5fd5, pid=15329, tid=15330
> #
> # Signal was sent by thread with id 15339
> # Reason: "blocking a safepoint"
> #
> ...
> # J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I (31 bytes) @
> 0x000003ff7ae6d508 [0x000003ff7ae6d3c0+0x0000000000000148]
> ...
> --------------- T H R E A D ---------------
>
> Current thread (0x0000000080039000): JavaThread "main" [_thread_in_Java,
> id=15330, stack(0x000003ff7e000000,0x000003ff7e100000)]
>
> Stack: [0x000003ff7e000000,0x000003ff7e100000], sp=0x000003ff7e0fe778,
> free space=1017k
> Native frames: (J=compiled Java code, A=aot compiled Java code,
> j=interpreted, Vv=VM code, C=native code)
> J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I (31 bytes) @
> 0x000003ff7ae6d508 [0x000003ff7ae6d3c0+0x0000000000000148]
> j TestAbortVMOnSafepointTimeout.main([Ljava/lang/String;)V+6
> v ~StubRoutines::call_stub
> V [libjvm.so+0xb0957a] JavaCalls::call_helper(JavaValue*, methodHandle
> const&, JavaCallArguments*, Thread*)+0x6b2
> V [libjvm.so+0xb08614] JavaCalls::call(JavaValue*, methodHandle const&,
> JavaCallArguments*, Thread*)+0x8c
> ...
> Event: 1.558 Thread 0x00000000808a4000 sent signal 4 to Thread
> 0x0000000080039000 because blocking a safepoint.
>
>
> Please review.
>
> Best regads,
> Martin
>
>
More information about the hotspot-runtime-dev
mailing list