RFR(M): 8219584: Try to dump error file by thread which causes safepoint timeout

Doerr, Martin martin.doerr at sap.com
Wed Feb 27 14:13:10 UTC 2019


Hi David and Thomas,

thanks for your valuable feedback and sorry for the delay. A more critical issue had kept me busy.
I like getting rid of the stuff in Thread class.

Here's the new version:
http://cr.openjdk.java.net/~mdoerr/8219584_kill_thread_on_safepoint_timeout/webrev.01/

It reports (on linux):
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=... (SI_TKILL), ...
#
...
# J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I ...
...
siginfo: si_signo: 4 (SIGILL), si_code: -6 (SI_TKILL), si_pid: ... (current process), si_uid: ...
...
Event: ... Thread .. sent signal 4 to Thread ... because blocking a safepoint.

Best regards,
Martin


-----Original Message-----
From: David Holmes <david.holmes at oracle.com> 
Sent: Montag, 25. Februar 2019 02:19
To: Thomas Stüfe <thomas.stuefe at gmail.com>; Doerr, Martin <martin.doerr at sap.com>
Cc: hotspot-runtime-dev at openjdk.java.net
Subject: Re: RFR(M): 8219584: Try to dump error file by thread which causes safepoint timeout

Hi Martin,

On 23/02/2019 5:54 am, Thomas Stüfe wrote:
> Hi Martin,
> 
> this is certainly valuable.
> 
> Not a full review, just some remarks. I think one could make this quite a
> bit simpler: The whole notion of storing a reason string and the sender TID
> etc in the Thread class only serves diagnostic purposes - to output a clear
> message in the hs-err file, right? I am not sure this is worth the added
> complexity though, since we already have most of that information in the
> hs-err file today:
> 
> "siginfo: si_signo: 8 (SIGFPE), si_code: -6 (SI_TKILL), si_addr:
> 0x0000040300007866 "
> 
> See "SI_TKILL" which means this signal was sent by another thread. The
> "si_addr" info is bogus in this case. With a tiny patch in
> os::print_siginfo() to tread SI_TKILL - if defined - like SI_USER, we could
> change this to:
> 
> "siginfo: si_signo: 4 (SIGILL), si_code: -6 (SI_TKILL), si_pid: 3929
> (current process), si_uid: 1027"
> 
> which would make more sense.
> 
> So, from the hs-err file we already know if a signal was sent by another
> thread. Granted, the sending thread id is missing, as is the explicit
> reason string for diagnostics. However, since the sending thread announces
> itself in the event log "I have sent this signal to that thread" this
> information should be there too.
> 
> So, my suggestion would be for the sake of simplicity to leave all this
> communication of reason, sender tid etc to the target thread out. That
> would also mean you can implement this independently from the Thread class.
> You do not need a valid thread class to send a signal to a thread id.

I agree with Thomas - lets gets this out of the Thread class! I do not 
like seeing all that platform specific code there.

> --
> 
> A second thing, we have similar coding already in error reporting, see
> VMError::interrupt_reporting_thread() in vmError_posix.cpp. Since this is
> basically the same, we could consolidate and move that functionality to
> os_posix.cpp, basically as a generic wrapper for pthread_kill. E.g.
> os::Posix::interrupt_thread(pthread_t target).

I'd prefer "signal_thread" rather than "interrupt_thread" - though I see 
the error reporter already uses "interrupt". :( (Causes confusion with 
Java thread 'interrupt' functionality.)

Promote it to os class, return boolean to indicate success, have win32 
just return false. Then we can get rid of all the _win32 ifdefs.

Thanks,
David
-----

> 
> --
> 
> Just my 5 cent. Lets see what others think.
> 
> Cheers, Thomas
> 
> 
> 
> 
> 
> 
> On Fri, Feb 22, 2019 at 4:36 PM Doerr, Martin <martin.doerr at sap.com> wrote:
> 
>> Hi all,
>>
>> the VM supports diagnostic flags -XX:+SafepointTimeout and
>> -XX:+AbortVMOnSafepointTimeout to detect safepoint synchronization timeouts
>> and to exit with an error message.
>> However, we usually don't see what the thread was doing which didn't reach
>> the safepoint.
>> We can get a more helpful hs_err file if we kill that thread and let it
>> dump the hs_err file.
>>
>> My following proposal does:
>>
>>    1.  Introduce a function for sending a signal to another thread (not for
>> Windows).
>>    2.  If possible, send a SIGILL to thread which didn't reach safepoint.
>>    3.  Make SafepointALot diagnostic instead of develop in order to make it
>> usable together with SafepointTimeout.
>>    4.  Extend error reporting to make it easy to recognize if the thread
>> was killed by another thread.
>>    5.  Add a jtreg test.
>>
>> Webrev:
>>
>> http://cr.openjdk.java.net/~mdoerr/8219584_kill_thread_on_safepoint_timeout/webrev.00/
>>
>>
>> The test contains a long running loop without safepoint compiled by C2.
>> The new enhancement leads to an hs_err output (excerpt):
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGILL (0x4) at pc=0x00003be1001f5fd5, pid=15329, tid=15330
>> #
>> # Signal was sent by thread with id 15339
>> # Reason: "blocking a safepoint"
>> #
>> ...
>> # J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I (31 bytes) @
>> 0x000003ff7ae6d508 [0x000003ff7ae6d3c0+0x0000000000000148]
>> ...
>> ---------------  T H R E A D  ---------------
>>
>> Current thread (0x0000000080039000):  JavaThread "main" [_thread_in_Java,
>> id=15330, stack(0x000003ff7e000000,0x000003ff7e100000)]
>>
>> Stack: [0x000003ff7e000000,0x000003ff7e100000],  sp=0x000003ff7e0fe778,
>> free space=1017k
>> Native frames: (J=compiled Java code, A=aot compiled Java code,
>> j=interpreted, Vv=VM code, C=native code)
>> J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I (31 bytes) @
>> 0x000003ff7ae6d508 [0x000003ff7ae6d3c0+0x0000000000000148]
>> j  TestAbortVMOnSafepointTimeout.main([Ljava/lang/String;)V+6
>> v  ~StubRoutines::call_stub
>> V  [libjvm.so+0xb0957a]  JavaCalls::call_helper(JavaValue*, methodHandle
>> const&, JavaCallArguments*, Thread*)+0x6b2
>> V  [libjvm.so+0xb08614]  JavaCalls::call(JavaValue*, methodHandle const&,
>> JavaCallArguments*, Thread*)+0x8c
>> ...
>> Event: 1.558 Thread 0x00000000808a4000 sent signal 4 to Thread
>> 0x0000000080039000 because blocking a safepoint.
>>
>>
>> Please review.
>>
>> Best regads,
>> Martin
>>
>>


More information about the hotspot-runtime-dev mailing list