RFR(M): 8219584: Try to dump error file by thread which causes safepoint timeout

Wed Feb 27 14:57:04 UTC 2019

Hi Martin,

I like this. I have minor points, most of them a matter of taste:

¹) I actually do not like much having os::signal_thread(int sig). Signals
are a Posix concept, there is nothing on windows like it. That is why we
still need +#ifndef _WINDOWS in safepoint.cpp, even though
os::signal_thread() is available on windows too.

I would prefer one of thise two things:

a) make the generic os::signal_thread() to a Posix-specific
os::Posix::signal_thread(int signal). Leave the #ifdef WINDOWS in
safepoint.cpp.
b) make os::signal_thread() something different which does not require
handing in a signal number. E.g. os::interrupt_thread() or similar. Then
you could loose the #ifdef in safepoint.cpp

But I can live with this too, I just do not find it very good abstraction.

2) in vmError.cpp, you want to have the signal info more prominently, which
I like. But I would just call print_siginfo() there. It is defined to print
a single line without /cr, so it should fit nicely into the "#.." section.
If you do not like the idea, you would have to not only handle TKILL but
also SI_KILL and SI_QUEUE (I think), duplicating the work in print_siginfo.
Note that TKILL is, I think, only available on Linux.

In both cases, lets see what others think.

Cheersm Thomas

On Wed, Feb 27, 2019 at 3:13 PM Doerr, Martin <martin.doerr at sap.com> wrote:

> Hi David and Thomas,
>
> thanks for your valuable feedback and sorry for the delay. A more critical
> issue had kept me busy.
> I like getting rid of the stuff in Thread class.
>
> Here's the new version:
>
> http://cr.openjdk.java.net/~mdoerr/8219584_kill_thread_on_safepoint_timeout/webrev.01/
>
> It reports (on linux):
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGILL (0x4) at pc=... (SI_TKILL), ...
> #
> ...
> # J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I ...
> ...
> siginfo: si_signo: 4 (SIGILL), si_code: -6 (SI_TKILL), si_pid: ...
> (current process), si_uid: ...
> ...
> Event: ... Thread .. sent signal 4 to Thread ... because blocking a
> safepoint.
>
> Best regards,
> Martin
>
>
> -----Original Message-----
> From: David Holmes <david.holmes at oracle.com>
> Sent: Montag, 25. Februar 2019 02:19
> To: Thomas Stüfe <thomas.stuefe at gmail.com>; Doerr, Martin <
> martin.doerr at sap.com>
> Cc: hotspot-runtime-dev at openjdk.java.net
> Subject: Re: RFR(M): 8219584: Try to dump error file by thread which
> causes safepoint timeout
>
> Hi Martin,
>
> On 23/02/2019 5:54 am, Thomas Stüfe wrote:
> > Hi Martin,
> >
> > this is certainly valuable.
> >
> > Not a full review, just some remarks. I think one could make this quite a
> > bit simpler: The whole notion of storing a reason string and the sender
> TID
> > etc in the Thread class only serves diagnostic purposes - to output a
> clear
> > message in the hs-err file, right? I am not sure this is worth the added
> > complexity though, since we already have most of that information in the
> > hs-err file today:
> >
> > "siginfo: si_signo: 8 (SIGFPE), si_code: -6 (SI_TKILL), si_addr:
> > 0x0000040300007866 "
> >
> > See "SI_TKILL" which means this signal was sent by another thread. The
> > "si_addr" info is bogus in this case. With a tiny patch in
> > os::print_siginfo() to tread SI_TKILL - if defined - like SI_USER, we
> could
> > change this to:
> >
> > "siginfo: si_signo: 4 (SIGILL), si_code: -6 (SI_TKILL), si_pid: 3929
> > (current process), si_uid: 1027"
> >
> > which would make more sense.
> >
> > So, from the hs-err file we already know if a signal was sent by another
> > thread. Granted, the sending thread id is missing, as is the explicit
> > reason string for diagnostics. However, since the sending thread
> announces
> > itself in the event log "I have sent this signal to that thread" this
> > information should be there too.
> >
> > So, my suggestion would be for the sake of simplicity to leave all this
> > communication of reason, sender tid etc to the target thread out. That
> > would also mean you can implement this independently from the Thread
> class.
> > You do not need a valid thread class to send a signal to a thread id.
>
> I agree with Thomas - lets gets this out of the Thread class! I do not
> like seeing all that platform specific code there.
>
> > --
> >
> > A second thing, we have similar coding already in error reporting, see
> > VMError::interrupt_reporting_thread() in vmError_posix.cpp. Since this is
> > basically the same, we could consolidate and move that functionality to
> > os_posix.cpp, basically as a generic wrapper for pthread_kill. E.g.
> > os::Posix::interrupt_thread(pthread_t target).
>
> I'd prefer "signal_thread" rather than "interrupt_thread" - though I see
> the error reporter already uses "interrupt". :( (Causes confusion with
> Java thread 'interrupt' functionality.)
>
> Promote it to os class, return boolean to indicate success, have win32
> just return false. Then we can get rid of all the _win32 ifdefs.
>
> Thanks,
> David
> -----
>
> >
> > --
> >
> > Just my 5 cent. Lets see what others think.
> >
> > Cheers, Thomas
> >
> >
> >
> >
> >
> >
> > On Fri, Feb 22, 2019 at 4:36 PM Doerr, Martin <martin.doerr at sap.com>
> wrote:
> >
> >> Hi all,
> >>
> >> the VM supports diagnostic flags -XX:+SafepointTimeout and
> >> -XX:+AbortVMOnSafepointTimeout to detect safepoint synchronization
> timeouts
> >> and to exit with an error message.
> >> However, we usually don't see what the thread was doing which didn't
> reach
> >> the safepoint.
> >> We can get a more helpful hs_err file if we kill that thread and let it
> >> dump the hs_err file.
> >>
> >> My following proposal does:
> >>
> >>    1.  Introduce a function for sending a signal to another thread (not
> for
> >> Windows).
> >>    2.  If possible, send a SIGILL to thread which didn't reach
> safepoint.
> >>    3.  Make SafepointALot diagnostic instead of develop in order to
> make it
> >> usable together with SafepointTimeout.
> >>    4.  Extend error reporting to make it easy to recognize if the thread
> >> was killed by another thread.
> >>    5.  Add a jtreg test.
> >>
> >> Webrev:
> >>
> >>
> http://cr.openjdk.java.net/~mdoerr/8219584_kill_thread_on_safepoint_timeout/webrev.00/
> >>
> >>
> >> The test contains a long running loop without safepoint compiled by C2.
> >> The new enhancement leads to an hs_err output (excerpt):
> >> #
> >> # A fatal error has been detected by the Java Runtime Environment:
> >> #
> >> #  SIGILL (0x4) at pc=0x00003be1001f5fd5, pid=15329, tid=15330
> >> #
> >> # Signal was sent by thread with id 15339
> >> # Reason: "blocking a safepoint"
> >> #
> >> ...
> >> # J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I (31 bytes) @
> >> 0x000003ff7ae6d508 [0x000003ff7ae6d3c0+0x0000000000000148]
> >> ...
> >> ---------------  T H R E A D  ---------------
> >>
> >> Current thread (0x0000000080039000):  JavaThread "main"
> [_thread_in_Java,
> >> id=15330, stack(0x000003ff7e000000,0x000003ff7e100000)]
> >>
> >> Stack: [0x000003ff7e000000,0x000003ff7e100000],  sp=0x000003ff7e0fe778,
> >> free space=1017k
> >> Native frames: (J=compiled Java code, A=aot compiled Java code,
> >> j=interpreted, Vv=VM code, C=native code)
> >> J 29 c2 TestAbortVMOnSafepointTimeout.test_loop(I)I (31 bytes) @
> >> 0x000003ff7ae6d508 [0x000003ff7ae6d3c0+0x0000000000000148]
> >> j  TestAbortVMOnSafepointTimeout.main([Ljava/lang/String;)V+6
> >> v  ~StubRoutines::call_stub
> >> V  [libjvm.so+0xb0957a]  JavaCalls::call_helper(JavaValue*, methodHandle
> >> const&, JavaCallArguments*, Thread*)+0x6b2
> >> V  [libjvm.so+0xb08614]  JavaCalls::call(JavaValue*, methodHandle
> const&,
> >> JavaCallArguments*, Thread*)+0x8c
> >> ...
> >> Event: 1.558 Thread 0x00000000808a4000 sent signal 4 to Thread
> >> 0x0000000080039000 because blocking a safepoint.
> >>
> >>
> >> Please review.
> >>
> >> Best regads,
> >> Martin
> >>
> >>
>