RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.
Chris Plummer
chris.plummer at oracle.com
Wed Oct 26 07:00:20 UTC 2016
Hi Tomas,
See JDK-8156821. I'm curious as to how your changes will impact it,
since David says you can't interrupt a thread blocked trying to acquire
mutex. I suspect that means this enhancement won't help in this case,
and presumably in general you are not fixing the issue of error
reporting getting deadlocked, or maybe I'm misinterpreting what David
said in JDK-8156821.
Otherwise overall your changes look good, but I have a few comments.
Also, since this is an enhancement, it needs to wait for JDK 10.
I think your test will fail for product builds. You should add
"@requires vm.debug == true". Also, java files use 4 char indentation,
not 2 like we use in hotspot C/C++ code. Lastly, it should only have a
2016 copyright.
A couple of files need the copyright updated to 2016.
Why do set_to_now() and get_timestamp() need to be atomic, and what are
the consequences of cx8 not being supported?
1282 st->print_raw_cr(buffer);
1283 st->cr();
The old code had an additional st->cr() before the above lines. I assume
you removed it intentionally.
Is there a reason why you decided to only allow one step to timeout.
What if the cause of a timeout in a step also impacts other steps, or is
that not common when we see timeouts?
It's not clear to me why you changed a couple of os::sleep() calls to
os::naked_short_sleep(), and the rationale for the sleep periods. Can
you please explain?
thanks,
Chris
On 10/12/16 9:55 PM, Thomas Stüfe wrote:
> Dear all,
>
> please take a look at the following fix:
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
> webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html
>
> ---
>
> In short, this fix provides the ability to cancel hanging error reporting
> steps. This uses the same code paths secondary error handling uses during
> error reporting. With this patch, steps which take too long will be
> canceled after 1/2 ErrorLogTimeout. In the log file, it will look like this:
>
> 4 [timeout occurred during error reporting in step "<stepname>"] after xxxx
> ms.
> 5
>
> and we now also get a finish message in the hs-err file if we hit the
> ErrorLogTimeout and error reporting will stop altogether:
>
> 6 ------ Timout during error reporting after xxx ms. ------
>
> (in addition to the "time expired, abort" message the WatcherThread writes
> to stderr)
>
> ---
>
> This is something which bugged us for a long time, because we rely heavily
> on the hs_err files for error analysis at customer sites, and there are a
> number of reasons why one step may hang and prevent the follow-up steps
> from running.
>
> It works like this:
>
> Before, when error reporting started, the WatcherThread was waiting for
> ErrorLogTimeout seconds, then would stop the VM.
>
> Now, the WatcherThread periodically pings error reporting, which checks if
> the last step did timeout. If it does, it sends a signal to the reporting
> thread, and the thread will continue with the next step. This follows the
> same path as secondary crash handling.
>
> Some implementation details:
>
> On Posix platforms, to interrupt the thread, I use pthread_kill. This means
> I must know the pthread id of the reporting thread, which I now store at
> the beginning of error reporting. We already store the reporting thread id
> in first_error_tid, but that I cannot use, because it gets set by
> os::current_thread_id(), which is not always the pthread id. Should we ever
> switch to only using pthread id for posix platforms, this coding can be
> simplified.
>
> On Windows, there is unfortunately no easy way to interrupt a
> non-cooperative thread. I would need a way to cause a SEH inside the target
> thread, which then would get handled by secondary error handling like on
> Posix platforms, but that is not easy. It is doable - one can suspend the
> thread, modify the thread context in a way that it will crash upon resume.
> But that felt a bit heavyweight for this problem. So on windows, timeout
> handling still works (after ErrorLogTimeout the VM gets shut down), but
> error reporting steps are not interruptable. If we feel this is important,
> this can be added later.
>
> Kind Regards, Thomas
More information about the hotspot-runtime-dev
mailing list