RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Chris Plummer chris.plummer at oracle.com
Wed Oct 26 07:00:20 UTC 2016


Hi Tomas,

See JDK-8156821. I'm curious as to how your changes will impact it, 
since David says you can't interrupt a thread blocked trying to acquire 
mutex. I suspect that means this enhancement won't help in this case, 
and presumably in general you are not fixing the issue of error 
reporting getting deadlocked, or maybe I'm misinterpreting what David 
said in JDK-8156821.

Otherwise overall your changes look good, but I have a few comments. 
Also, since this is an enhancement, it needs to wait for JDK 10.

I think your test will fail for product builds. You should add 
"@requires vm.debug == true". Also, java files use 4 char indentation, 
not 2 like we use in hotspot C/C++ code. Lastly, it should only have a 
2016 copyright.

A couple of files need the copyright updated to 2016.

Why do set_to_now() and get_timestamp() need to be atomic, and what are 
the consequences of cx8 not being supported?

1282         st->print_raw_cr(buffer);
1283         st->cr();

The old code had an additional st->cr() before the above lines. I assume 
you removed it intentionally.

Is there a reason why you decided to only allow one step to timeout. 
What if the cause of a timeout in a step also impacts other steps, or is 
that not common when we see timeouts?

It's not clear to me why you changed a couple of os::sleep() calls to 
os::naked_short_sleep(), and the rationale for the sleep periods. Can 
you please explain?

thanks,

Chris

On 10/12/16 9:55 PM, Thomas Stüfe wrote:
> Dear all,
>
> please take a look at the following fix:
>
> Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
> webrev:
> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html
>
> ---
>
> In short, this fix provides the ability to cancel hanging error reporting
> steps. This uses the same code paths secondary error handling uses during
> error reporting. With this patch, steps which take too long will be
> canceled after 1/2 ErrorLogTimeout. In the log file, it will look like this:
>
> 4 [timeout occurred during error reporting in step "<stepname>"] after xxxx
> ms.
> 5
>
> and we now also get a finish message in the hs-err file if we hit the
> ErrorLogTimeout and error reporting will stop altogether:
>
> 6 ------ Timout during error reporting after xxx ms. ------
>
> (in addition to the "time expired, abort" message the WatcherThread writes
> to stderr)
>
> ---
>
> This is something which bugged us for a long time, because we rely heavily
> on the hs_err files for error analysis at customer sites, and there are a
> number of reasons why one step may hang and prevent the follow-up steps
> from running.
>
> It works like this:
>
> Before, when error reporting started, the WatcherThread was waiting for
> ErrorLogTimeout seconds, then would stop the VM.
>
> Now, the WatcherThread periodically pings error reporting, which checks if
> the last step did timeout. If it does, it sends a signal to the reporting
> thread, and the thread will continue with the next step. This follows the
> same path as secondary crash handling.
>
> Some implementation details:
>
> On Posix platforms, to interrupt the thread, I use pthread_kill. This means
> I must know the pthread id of the reporting thread, which I now store at
> the beginning of error reporting. We already store the reporting thread id
> in first_error_tid, but that I cannot use, because it gets set by
> os::current_thread_id(), which is not always the pthread id. Should we ever
> switch to only using pthread id for posix platforms, this coding can be
> simplified.
>
> On Windows, there is unfortunately no easy way to interrupt a
> non-cooperative thread. I would need a way to cause a SEH inside the target
> thread, which then would get handled by secondary error handling like on
> Posix platforms, but that is not easy. It is doable - one can suspend the
> thread, modify the thread context in a way that it will crash upon resume.
> But that felt a bit heavyweight for this problem. So on windows, timeout
> handling still works (after ErrorLogTimeout the VM gets shut down), but
> error reporting steps are not interruptable. If we feel this is important,
> this can be added later.
>
> Kind Regards, Thomas





More information about the hotspot-runtime-dev mailing list