RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Tue Oct 18 06:22:08 UTC 2016

Ping.

On Thu, Oct 13, 2016 at 6:55 AM, Thomas Stüfe <thomas.stuefe at gmail.com>
wrote:

> Dear all,
>
> please take a look at the following fix:
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-
> Hanging-Error-Reporting/webrev.00/webrev/index.html
>
> ---
>
> In short, this fix provides the ability to cancel hanging error reporting
> steps. This uses the same code paths secondary error handling uses during
> error reporting. With this patch, steps which take too long will be
> canceled after 1/2 ErrorLogTimeout. In the log file, it will look like this:
>
> 4 [timeout occurred during error reporting in step "<stepname>"] after
> xxxx ms.
> 5
>
> and we now also get a finish message in the hs-err file if we hit the
> ErrorLogTimeout and error reporting will stop altogether:
>
> 6 ------ Timout during error reporting after xxx ms. ------
>
> (in addition to the "time expired, abort" message the WatcherThread writes
> to stderr)
>
> ---
>
> This is something which bugged us for a long time, because we rely heavily
> on the hs_err files for error analysis at customer sites, and there are a
> number of reasons why one step may hang and prevent the follow-up steps
> from running.
>
> It works like this:
>
> Before, when error reporting started, the WatcherThread was waiting for
> ErrorLogTimeout seconds, then would stop the VM.
>
> Now, the WatcherThread periodically pings error reporting, which checks if
> the last step did timeout. If it does, it sends a signal to the reporting
> thread, and the thread will continue with the next step. This follows the
> same path as secondary crash handling.
>
> Some implementation details:
>
> On Posix platforms, to interrupt the thread, I use pthread_kill. This
> means I must know the pthread id of the reporting thread, which I now store
> at the beginning of error reporting. We already store the reporting thread
> id in first_error_tid, but that I cannot use, because it gets set by
> os::current_thread_id(), which is not always the pthread id. Should we ever
> switch to only using pthread id for posix platforms, this coding can be
> simplified.
>
> On Windows, there is unfortunately no easy way to interrupt a
> non-cooperative thread. I would need a way to cause a SEH inside the target
> thread, which then would get handled by secondary error handling like on
> Posix platforms, but that is not easy. It is doable - one can suspend the
> thread, modify the thread context in a way that it will crash upon resume.
> But that felt a bit heavyweight for this problem. So on windows, timeout
> handling still works (after ErrorLogTimeout the VM gets shut down), but
> error reporting steps are not interruptable. If we feel this is important,
> this can be added later.
>
> Kind Regards, Thomas
>
>
>
>
>
>
>
>
>
>
>