RFR(M): 8219584: Try to dump error file by thread which causes safepoint timeout
Doerr, Martin
martin.doerr at sap.com
Wed Feb 27 14:13:10 UTC 2019
Hi David and Thomas,
thanks for your valuable feedback and sorry for the delay. A more critical issue had kept me busy.
I like getting rid of the stuff in Thread class.
Here's the new version:
http://cr.openjdk.java.net/~mdoerr/8219584_kill_thread_on_safepoint_timeout/webrev.01/
It reports (on linux):
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGILL (0x4) at pc=... (SI_TKILL), ...
#
...
# J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I ...
...
siginfo: si_signo: 4 (SIGILL), si_code: -6 (SI_TKILL), si_pid: ... (current process), si_uid: ...
...
Event: ... Thread .. sent signal 4 to Thread ... because blocking a safepoint.
Best regards,
Martin
-----Original Message-----
From: David Holmes <david.holmes at oracle.com>
Sent: Montag, 25. Februar 2019 02:19
To: Thomas Stüfe <thomas.stuefe at gmail.com>; Doerr, Martin <martin.doerr at sap.com>
Cc: hotspot-runtime-dev at openjdk.java.net
Subject: Re: RFR(M): 8219584: Try to dump error file by thread which causes safepoint timeout
Hi Martin,
On 23/02/2019 5:54 am, Thomas Stüfe wrote:
> Hi Martin,
>
> this is certainly valuable.
>
> Not a full review, just some remarks. I think one could make this quite a
> bit simpler: The whole notion of storing a reason string and the sender TID
> etc in the Thread class only serves diagnostic purposes - to output a clear
> message in the hs-err file, right? I am not sure this is worth the added
> complexity though, since we already have most of that information in the
> hs-err file today:
>
> "siginfo: si_signo: 8 (SIGFPE), si_code: -6 (SI_TKILL), si_addr:
> 0x0000040300007866 "
>
> See "SI_TKILL" which means this signal was sent by another thread. The
> "si_addr" info is bogus in this case. With a tiny patch in
> os::print_siginfo() to tread SI_TKILL - if defined - like SI_USER, we could
> change this to:
>
> "siginfo: si_signo: 4 (SIGILL), si_code: -6 (SI_TKILL), si_pid: 3929
> (current process), si_uid: 1027"
>
> which would make more sense.
>
> So, from the hs-err file we already know if a signal was sent by another
> thread. Granted, the sending thread id is missing, as is the explicit
> reason string for diagnostics. However, since the sending thread announces
> itself in the event log "I have sent this signal to that thread" this
> information should be there too.
>
> So, my suggestion would be for the sake of simplicity to leave all this
> communication of reason, sender tid etc to the target thread out. That
> would also mean you can implement this independently from the Thread class.
> You do not need a valid thread class to send a signal to a thread id.
I agree with Thomas - lets gets this out of the Thread class! I do not
like seeing all that platform specific code there.
> --
>
> A second thing, we have similar coding already in error reporting, see
> VMError::interrupt_reporting_thread() in vmError_posix.cpp. Since this is
> basically the same, we could consolidate and move that functionality to
> os_posix.cpp, basically as a generic wrapper for pthread_kill. E.g.
> os::Posix::interrupt_thread(pthread_t target).
I'd prefer "signal_thread" rather than "interrupt_thread" - though I see
the error reporter already uses "interrupt". :( (Causes confusion with
Java thread 'interrupt' functionality.)
Promote it to os class, return boolean to indicate success, have win32
just return false. Then we can get rid of all the _win32 ifdefs.
Thanks,
David
-----
>
> --
>
> Just my 5 cent. Lets see what others think.
>
> Cheers, Thomas
>
>
>
>
>
>
> On Fri, Feb 22, 2019 at 4:36 PM Doerr, Martin <martin.doerr at sap.com> wrote:
>
>> Hi all,
>>
>> the VM supports diagnostic flags -XX:+SafepointTimeout and
>> -XX:+AbortVMOnSafepointTimeout to detect safepoint synchronization timeouts
>> and to exit with an error message.
>> However, we usually don't see what the thread was doing which didn't reach
>> the safepoint.
>> We can get a more helpful hs_err file if we kill that thread and let it
>> dump the hs_err file.
>>
>> My following proposal does:
>>
>> 1. Introduce a function for sending a signal to another thread (not for
>> Windows).
>> 2. If possible, send a SIGILL to thread which didn't reach safepoint.
>> 3. Make SafepointALot diagnostic instead of develop in order to make it
>> usable together with SafepointTimeout.
>> 4. Extend error reporting to make it easy to recognize if the thread
>> was killed by another thread.
>> 5. Add a jtreg test.
>>
>> Webrev:
>>
>> http://cr.openjdk.java.net/~mdoerr/8219584_kill_thread_on_safepoint_timeout/webrev.00/
>>
>>
>> The test contains a long running loop without safepoint compiled by C2.
>> The new enhancement leads to an hs_err output (excerpt):
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> # SIGILL (0x4) at pc=0x00003be1001f5fd5, pid=15329, tid=15330
>> #
>> # Signal was sent by thread with id 15339
>> # Reason: "blocking a safepoint"
>> #
>> ...
>> # J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I (31 bytes) @
>> 0x000003ff7ae6d508 [0x000003ff7ae6d3c0+0x0000000000000148]
>> ...
>> --------------- T H R E A D ---------------
>>
>> Current thread (0x0000000080039000): JavaThread "main" [_thread_in_Java,
>> id=15330, stack(0x000003ff7e000000,0x000003ff7e100000)]
>>
>> Stack: [0x000003ff7e000000,0x000003ff7e100000], sp=0x000003ff7e0fe778,
>> free space=1017k
>> Native frames: (J=compiled Java code, A=aot compiled Java code,
>> j=interpreted, Vv=VM code, C=native code)
>> J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I (31 bytes) @
>> 0x000003ff7ae6d508 [0x000003ff7ae6d3c0+0x0000000000000148]
>> j TestAbortVMOnSafepointTimeout.main([Ljava/lang/String;)V+6
>> v ~StubRoutines::call_stub
>> V [libjvm.so+0xb0957a] JavaCalls::call_helper(JavaValue*, methodHandle
>> const&, JavaCallArguments*, Thread*)+0x6b2
>> V [libjvm.so+0xb08614] JavaCalls::call(JavaValue*, methodHandle const&,
>> JavaCallArguments*, Thread*)+0x8c
>> ...
>> Event: 1.558 Thread 0x00000000808a4000 sent signal 4 to Thread
>> 0x0000000080039000 because blocking a safepoint.
>>
>>
>> Please review.
>>
>> Best regads,
>> Martin
>>
>>
More information about the hotspot-runtime-dev
mailing list