RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Tue Oct 18 07:16:19 UTC 2016

Hi Thomas,

I took an initial look but am still mulling over things.

Note that as an enhancement this will need to wait for Java 10 repos to 
open - unless you go through the FC extension process.

Thanks,
David

On 18/10/2016 4:22 PM, Thomas Stüfe wrote:
> Ping.
>
> On Thu, Oct 13, 2016 at 6:55 AM, Thomas Stüfe <thomas.stuefe at gmail.com>
> wrote:
>
>> Dear all,
>>
>> please take a look at the following fix:
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-
>> Hanging-Error-Reporting/webrev.00/webrev/index.html
>>
>> ---
>>
>> In short, this fix provides the ability to cancel hanging error reporting
>> steps. This uses the same code paths secondary error handling uses during
>> error reporting. With this patch, steps which take too long will be
>> canceled after 1/2 ErrorLogTimeout. In the log file, it will look like this:
>>
>> 4 [timeout occurred during error reporting in step "<stepname>"] after
>> xxxx ms.
>> 5
>>
>> and we now also get a finish message in the hs-err file if we hit the
>> ErrorLogTimeout and error reporting will stop altogether:
>>
>> 6 ------ Timout during error reporting after xxx ms. ------
>>
>> (in addition to the "time expired, abort" message the WatcherThread writes
>> to stderr)
>>
>> ---
>>
>> This is something which bugged us for a long time, because we rely heavily
>> on the hs_err files for error analysis at customer sites, and there are a
>> number of reasons why one step may hang and prevent the follow-up steps
>> from running.
>>
>> It works like this:
>>
>> Before, when error reporting started, the WatcherThread was waiting for
>> ErrorLogTimeout seconds, then would stop the VM.
>>
>> Now, the WatcherThread periodically pings error reporting, which checks if
>> the last step did timeout. If it does, it sends a signal to the reporting
>> thread, and the thread will continue with the next step. This follows the
>> same path as secondary crash handling.
>>
>> Some implementation details:
>>
>> On Posix platforms, to interrupt the thread, I use pthread_kill. This
>> means I must know the pthread id of the reporting thread, which I now store
>> at the beginning of error reporting. We already store the reporting thread
>> id in first_error_tid, but that I cannot use, because it gets set by
>> os::current_thread_id(), which is not always the pthread id. Should we ever
>> switch to only using pthread id for posix platforms, this coding can be
>> simplified.
>>
>> On Windows, there is unfortunately no easy way to interrupt a
>> non-cooperative thread. I would need a way to cause a SEH inside the target
>> thread, which then would get handled by secondary error handling like on
>> Posix platforms, but that is not easy. It is doable - one can suspend the
>> thread, modify the thread context in a way that it will crash upon resume.
>> But that felt a bit heavyweight for this problem. So on windows, timeout
>> handling still works (after ErrorLogTimeout the VM gets shut down), but
>> error reporting steps are not interruptable. If we feel this is important,
>> this can be added later.
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>