RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes david.holmes at oracle.com
Tue Oct 25 05:46:21 UTC 2016


On 18/10/2016 5:16 PM, David Holmes wrote:
> Hi Thomas,
>
> I took an initial look but am still mulling over things.

Sorry Thomas haven't had a chance to get back to this. Hard to find time 
for future features/enhancements at the moment. :)

Others should feel free to chime in on this. :)

David

> Note that as an enhancement this will need to wait for Java 10 repos to
> open - unless you go through the FC extension process.
>
> Thanks,
> David
>
> On 18/10/2016 4:22 PM, Thomas Stüfe wrote:
>> Ping.
>>
>> On Thu, Oct 13, 2016 at 6:55 AM, Thomas Stüfe <thomas.stuefe at gmail.com>
>> wrote:
>>
>>> Dear all,
>>>
>>> please take a look at the following fix:
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
>>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-
>>> Hanging-Error-Reporting/webrev.00/webrev/index.html
>>>
>>> ---
>>>
>>> In short, this fix provides the ability to cancel hanging error
>>> reporting
>>> steps. This uses the same code paths secondary error handling uses
>>> during
>>> error reporting. With this patch, steps which take too long will be
>>> canceled after 1/2 ErrorLogTimeout. In the log file, it will look
>>> like this:
>>>
>>> 4 [timeout occurred during error reporting in step "<stepname>"] after
>>> xxxx ms.
>>> 5
>>>
>>> and we now also get a finish message in the hs-err file if we hit the
>>> ErrorLogTimeout and error reporting will stop altogether:
>>>
>>> 6 ------ Timout during error reporting after xxx ms. ------
>>>
>>> (in addition to the "time expired, abort" message the WatcherThread
>>> writes
>>> to stderr)
>>>
>>> ---
>>>
>>> This is something which bugged us for a long time, because we rely
>>> heavily
>>> on the hs_err files for error analysis at customer sites, and there
>>> are a
>>> number of reasons why one step may hang and prevent the follow-up steps
>>> from running.
>>>
>>> It works like this:
>>>
>>> Before, when error reporting started, the WatcherThread was waiting for
>>> ErrorLogTimeout seconds, then would stop the VM.
>>>
>>> Now, the WatcherThread periodically pings error reporting, which
>>> checks if
>>> the last step did timeout. If it does, it sends a signal to the
>>> reporting
>>> thread, and the thread will continue with the next step. This follows
>>> the
>>> same path as secondary crash handling.
>>>
>>> Some implementation details:
>>>
>>> On Posix platforms, to interrupt the thread, I use pthread_kill. This
>>> means I must know the pthread id of the reporting thread, which I now
>>> store
>>> at the beginning of error reporting. We already store the reporting
>>> thread
>>> id in first_error_tid, but that I cannot use, because it gets set by
>>> os::current_thread_id(), which is not always the pthread id. Should
>>> we ever
>>> switch to only using pthread id for posix platforms, this coding can be
>>> simplified.
>>>
>>> On Windows, there is unfortunately no easy way to interrupt a
>>> non-cooperative thread. I would need a way to cause a SEH inside the
>>> target
>>> thread, which then would get handled by secondary error handling like on
>>> Posix platforms, but that is not easy. It is doable - one can suspend
>>> the
>>> thread, modify the thread context in a way that it will crash upon
>>> resume.
>>> But that felt a bit heavyweight for this problem. So on windows, timeout
>>> handling still works (after ErrorLogTimeout the VM gets shut down), but
>>> error reporting steps are not interruptable. If we feel this is
>>> important,
>>> this can be added later.
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>


More information about the hotspot-runtime-dev mailing list