RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thu Oct 27 07:16:39 UTC 2016

Hi David,

On Thu, Oct 27, 2016 at 2:40 AM, David Holmes <david.holmes at oracle.com>
wrote:

> On 27/10/2016 12:45 AM, Thomas Stüfe wrote:
>
>> Hi Chris,
>>
>> Thanks for the review!
>>
>> New
>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/webrev.01/webrev/
>>
>
> Have not looked at this yet.
>
> Comments inline.
>>
>> On Wed, Oct 26, 2016 at 9:00 AM, Chris Plummer <chris.plummer at oracle.com
>> <mailto:chris.plummer at oracle.com>> wrote:
>>
>>     Hi Tomas,
>>
>>     See JDK-8156821. I'm curious as to how your changes will impact it,
>>     since David says you can't interrupt a thread blocked trying to
>>     acquire mutex. I suspect that means this enhancement won't help in
>>     this case, and presumably in general you are not fixing the issue of
>>     error reporting getting deadlocked, or maybe I'm misinterpreting
>>     what David said in JDK-8156821.
>>
>
> That should be 8156823
>
>
>>
>> Not sure what exactly David meant with "You can't "interrupt" a thread
>> that is blocked trying to acquire a mutex." Maybe he can elaborate :)
>>
>> My guesses:
>>
>> 1) If he meant "you cannot interrupt a thread blocking in
>> pthread_mutex_lock()" - not true, you can and my patch works just fine
>> in this situation. Just tested again, to be sure. This covers crashes in
>> sections guarded by pthread_mutex, which then try to reaquire the lock
>> in the error handler.
>>
>
> There is no specified, portable way to get a thread blocked acquiring a
> mutex to stop waiting for the mutex. That is what I meant.
> pthread_mutex_lock is not a cancellation point, nor will it return EINTR in
> response to a signal.
>
> However, if a signal is received by the thread while waiting then POSIX
> semantics indicate that the signal handler will run and then return the
> thread to the waiting state. In our case the crash handler does not return
> so we are into undefined territory there - but our crash handler is already
> not a well-defined signal handler as it is not restricted to
> async-signal-safe functions, so we already run a risk when executing it.
>
>
That was what I meant. Syscalls have not to be interruptible per design,
they just have to call user signal handlers for asynchronous signals.

And I think my patch does not make matters worse or more unsafe. There is
no new concept here - I use the pre-existing secondary signal handling and
that only in situations in which otherwise the error handling would very
probably hang forever, not producing any error log at all.

> I had not considered this aspect in relation to 8156823, so the proposed
> approach here would also attempt to address that issue.
>
> 2) If he meant "you cannot interrupt malloc if it is executing a system
>> call in the linux kernel" - that may be. I am not a linux kernel expert
>> but would have thought that syscalls may block if interrupts are
>> disabled for certain lengths by the syscall author. But in that case i
>> would have expected the process to hang too and to be not killable?
>> Again, I am no expert.
>>
>
> Note "interrupt" here is a logical concept not related to hardware level
> interrupts. I don't know at what point going into malloc you will no longer
> get signal handlers run - malloc doesn't use pthread level mutexes, but
> direct futuxes, so the same signal responsiveness may not be present.
>

I will have to take a closer look at what the glibc does. I always thought
that any locks it takes in user space are interruptible by signals, and
that libc calls only become uninterruptible when it calls kernel syscalls -
and those kernel syscalls cannot be interrupted (that was what I meant with
interrupts disabled). It may be wrong. Ill have to rethink this.

But whatever the outcome, there may be situations where a thread cannot be
interrupted by pthread_kill, but I think those cases are rare. More often
we just wait in very ordinary situations, be it a pthread mutex deadlock or
a slow file system. e.g. one of the more common scenarios is when you want
to print a stack trace and try to load symbol information to resolve an pc
to a name.

Thanks, Thomas

>
> Thanks,
> David
> -----
>
>
>
>>
>>     Otherwise overall your changes look good, but I have a few comments.
>>     Also, since this is an enhancement, it needs to wait for JDK 10.
>>
>>     I think your test will fail for product builds. You should add
>>     "@requires vm.debug == true". Also, java files use 4 char
>>     indentation, not 2 like we use in hotspot C/C++ code. Lastly, it
>>     should only have a 2016 copyright.
>>
>>
>> Thank you for the hints. Did fix all that. Note that I had disabled the
>> test for product builds in the code (!Platform.isDebugBuild()) but I
>> added the vm.debug tag as well as you suggested.
>>
>>
>>     A couple of files need the copyright updated to 2016.
>>
>>     Why do set_to_now() and get_timestamp() need to be atomic, and what
>>     are the consequences of cx8 not being supported?
>>
>>
>> The error reporting thread sets the timestamp on each STEP start, and
>> the timestamp is read from another thread, the WatcherThread. Timestamp
>> is 64bit. I wanted to make sure the 64bit value is written and read
>> atomically, especially on 32bit platforms.
>>
>> But then, I had to check whether 64bit atomic stores/loads are even
>> supported by this platform (I actually did not find a 32bit platform
>> whithout 64bit atomics, but the comment in atomic.hpp is pretty
>> insistent and I did not want to risk regressions for other platforms).
>>
>> Well, if no cx8 support was available, I pretty much just give up and
>> read and write timestamps directly. As I said, I am not sure if this
>> code path gets ever executed.
>>
>> Maybe I was overthinking all this and just reading and writing the (C++
>> volatile) jlongs would have been enough, but I wanted to prevent
>> sporadic test errors because of incompletely read 64bit values.
>>
>>
>>     1282         st->print_raw_cr(buffer);
>>     1283         st->cr();
>>
>>     The old code had an additional st->cr() before the above lines. I
>>     assume you removed it intentionally.
>>
>>
>> I hope I preserved the numbers of cr(). At least that was my intention:
>>
>> 1260       outputStream* const st = log.is_open() ? &log : &out;
>> 1261       st->cr();
>>
>> ...
>>
>> and then on every path, a cr (or print_raw_cr) at the end. Where do you
>> see the missing cr()?
>>
>>
>>
>>     Is there a reason why you decided to only allow one step to timeout.
>>     What if the cause of a timeout in a step also impacts other steps,
>>     or is that not common when we see timeouts?
>>
>>
>> That is mostly guesswork. In our (SAP) code we allow for four steps (so
>> ErrorLogTimeout/4 as step timeout) and additionally allow for "steps
>> known to be long" where timeouts are disabled altogether. But we also
>> have more complicated error reporting steps, so when porting the patch
>> to OpenJDK, I felt the complexity was unneeded.
>>
>> I think in general you will only have one misbehaving step, but you are
>> right, more than one step may timeout if e.g. the file system is slow.
>> I'm open for suggestions: the timeout value should be large enough not
>> to be hit for "normal slow steps" while still leave room enough for
>> other steps to finish. What do you think a reasonable timeout value
>> would be? ErrorLogTimeout/4?
>>
>>
>>
>>     It's not clear to me why you changed a couple of os::sleep() calls
>>     to os::naked_short_sleep(), and the rationale for the sleep periods.
>>     Can you please explain?
>>
>>
>> Because os::sleep() does a lot of work under the hood and relies on a
>> bit of VM infrastructure. I think that is not a good idea in error
>> situations where potentially everything may be broken already. You want
>> to step lightly and really only do a naked system sleep. About the sleep
>> periods, os::naked_sleep has an inbuilt maximum value of 1000ms, which I
>> have to stay below to not hit the assert. I did use 999ms as the longest
>> interval I am allowed to sleep nakedly. And after the timeout hit and
>> before the WatcherThread calls os::abort, I again sleep 200ms to give
>> the error reporter thread time to write the "error log aborted due to
>> timeout" into the error log and to flush the error log. Those 200ms are
>> just guesswork.
>>
>>
>>     thanks,
>>
>>     Chris
>>
>>
>> Thanks for the review!
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>>     On 10/12/16 9:55 PM, Thomas Stüfe wrote:
>>
>>         Dear all,
>>
>>         please take a look at the following fix:
>>
>>         Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
>>         <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>         webrev:
>>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/webrev.00/webrev/index.html
>>         <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/webrev.00/webrev/index.html>
>>
>>         ---
>>
>>         In short, this fix provides the ability to cancel hanging error
>>         reporting
>>         steps. This uses the same code paths secondary error handling
>>         uses during
>>         error reporting. With this patch, steps which take too long will
>> be
>>         canceled after 1/2 ErrorLogTimeout. In the log file, it will
>>         look like this:
>>
>>         4 [timeout occurred during error reporting in step "<stepname>"]
>>         after xxxx
>>         ms.
>>         5
>>
>>         and we now also get a finish message in the hs-err file if we
>>         hit the
>>         ErrorLogTimeout and error reporting will stop altogether:
>>
>>         6 ------ Timout during error reporting after xxx ms. ------
>>
>>         (in addition to the "time expired, abort" message the
>>         WatcherThread writes
>>         to stderr)
>>
>>         ---
>>
>>         This is something which bugged us for a long time, because we
>>         rely heavily
>>         on the hs_err files for error analysis at customer sites, and
>>         there are a
>>         number of reasons why one step may hang and prevent the
>>         follow-up steps
>>         from running.
>>
>>         It works like this:
>>
>>         Before, when error reporting started, the WatcherThread was
>>         waiting for
>>         ErrorLogTimeout seconds, then would stop the VM.
>>
>>         Now, the WatcherThread periodically pings error reporting, which
>>         checks if
>>         the last step did timeout. If it does, it sends a signal to the
>>         reporting
>>         thread, and the thread will continue with the next step. This
>>         follows the
>>         same path as secondary crash handling.
>>
>>         Some implementation details:
>>
>>         On Posix platforms, to interrupt the thread, I use pthread_kill.
>>         This means
>>         I must know the pthread id of the reporting thread, which I now
>>         store at
>>         the beginning of error reporting. We already store the reporting
>>         thread id
>>         in first_error_tid, but that I cannot use, because it gets set by
>>         os::current_thread_id(), which is not always the pthread id.
>>         Should we ever
>>         switch to only using pthread id for posix platforms, this coding
>>         can be
>>         simplified.
>>
>>         On Windows, there is unfortunately no easy way to interrupt a
>>         non-cooperative thread. I would need a way to cause a SEH inside
>>         the target
>>         thread, which then would get handled by secondary error handling
>>         like on
>>         Posix platforms, but that is not easy. It is doable - one can
>>         suspend the
>>         thread, modify the thread context in a way that it will crash
>>         upon resume.
>>         But that felt a bit heavyweight for this problem. So on windows,
>>         timeout
>>         handling still works (after ErrorLogTimeout the VM gets shut
>>         down), but
>>         error reporting steps are not interruptable. If we feel this is
>>         important,
>>         this can be added later.
>>
>>         Kind Regards, Thomas
>>
>>
>>
>>
>>
>>