RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thu Oct 13 04:55:55 UTC 2016

Dear all,

please take a look at the following fix:

Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html

---

In short, this fix provides the ability to cancel hanging error reporting
steps. This uses the same code paths secondary error handling uses during
error reporting. With this patch, steps which take too long will be
canceled after 1/2 ErrorLogTimeout. In the log file, it will look like this:

4 [timeout occurred during error reporting in step "<stepname>"] after xxxx
ms.
5

and we now also get a finish message in the hs-err file if we hit the
ErrorLogTimeout and error reporting will stop altogether:

6 ------ Timout during error reporting after xxx ms. ------

(in addition to the "time expired, abort" message the WatcherThread writes
to stderr)

---

This is something which bugged us for a long time, because we rely heavily
on the hs_err files for error analysis at customer sites, and there are a
number of reasons why one step may hang and prevent the follow-up steps
from running.

It works like this:

Before, when error reporting started, the WatcherThread was waiting for
ErrorLogTimeout seconds, then would stop the VM.

Now, the WatcherThread periodically pings error reporting, which checks if
the last step did timeout. If it does, it sends a signal to the reporting
thread, and the thread will continue with the next step. This follows the
same path as secondary crash handling.

Some implementation details:

On Posix platforms, to interrupt the thread, I use pthread_kill. This means
I must know the pthread id of the reporting thread, which I now store at
the beginning of error reporting. We already store the reporting thread id
in first_error_tid, but that I cannot use, because it gets set by
os::current_thread_id(), which is not always the pthread id. Should we ever
switch to only using pthread id for posix platforms, this coding can be
simplified.

On Windows, there is unfortunately no easy way to interrupt a
non-cooperative thread. I would need a way to cause a SEH inside the target
thread, which then would get handled by secondary error handling like on
Posix platforms, but that is not easy. It is doable - one can suspend the
thread, modify the thread context in a way that it will crash upon resume.
But that felt a bit heavyweight for this problem. So on windows, timeout
handling still works (after ErrorLogTimeout the VM gets shut down), but
error reporting steps are not interruptable. If we feel this is important,
this can be added later.

Kind Regards, Thomas