RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.
Thomas Stüfe
thomas.stuefe at gmail.com
Thu Oct 27 07:16:39 UTC 2016
Hi David,
On Thu, Oct 27, 2016 at 2:40 AM, David Holmes <david.holmes at oracle.com>
wrote:
> On 27/10/2016 12:45 AM, Thomas Stüfe wrote:
>
>> Hi Chris,
>>
>> Thanks for the review!
>>
>> New
>> webrev: http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/webrev.01/webrev/
>>
>
> Have not looked at this yet.
>
> Comments inline.
>>
>> On Wed, Oct 26, 2016 at 9:00 AM, Chris Plummer <chris.plummer at oracle.com
>> <mailto:chris.plummer at oracle.com>> wrote:
>>
>> Hi Tomas,
>>
>> See JDK-8156821. I'm curious as to how your changes will impact it,
>> since David says you can't interrupt a thread blocked trying to
>> acquire mutex. I suspect that means this enhancement won't help in
>> this case, and presumably in general you are not fixing the issue of
>> error reporting getting deadlocked, or maybe I'm misinterpreting
>> what David said in JDK-8156821.
>>
>
> That should be 8156823
>
>
>>
>> Not sure what exactly David meant with "You can't "interrupt" a thread
>> that is blocked trying to acquire a mutex." Maybe he can elaborate :)
>>
>> My guesses:
>>
>> 1) If he meant "you cannot interrupt a thread blocking in
>> pthread_mutex_lock()" - not true, you can and my patch works just fine
>> in this situation. Just tested again, to be sure. This covers crashes in
>> sections guarded by pthread_mutex, which then try to reaquire the lock
>> in the error handler.
>>
>
> There is no specified, portable way to get a thread blocked acquiring a
> mutex to stop waiting for the mutex. That is what I meant.
> pthread_mutex_lock is not a cancellation point, nor will it return EINTR in
> response to a signal.
>
> However, if a signal is received by the thread while waiting then POSIX
> semantics indicate that the signal handler will run and then return the
> thread to the waiting state. In our case the crash handler does not return
> so we are into undefined territory there - but our crash handler is already
> not a well-defined signal handler as it is not restricted to
> async-signal-safe functions, so we already run a risk when executing it.
>
>
That was what I meant. Syscalls have not to be interruptible per design,
they just have to call user signal handlers for asynchronous signals.
And I think my patch does not make matters worse or more unsafe. There is
no new concept here - I use the pre-existing secondary signal handling and
that only in situations in which otherwise the error handling would very
probably hang forever, not producing any error log at all.
> I had not considered this aspect in relation to 8156823, so the proposed
> approach here would also attempt to address that issue.
>
> 2) If he meant "you cannot interrupt malloc if it is executing a system
>> call in the linux kernel" - that may be. I am not a linux kernel expert
>> but would have thought that syscalls may block if interrupts are
>> disabled for certain lengths by the syscall author. But in that case i
>> would have expected the process to hang too and to be not killable?
>> Again, I am no expert.
>>
>
> Note "interrupt" here is a logical concept not related to hardware level
> interrupts. I don't know at what point going into malloc you will no longer
> get signal handlers run - malloc doesn't use pthread level mutexes, but
> direct futuxes, so the same signal responsiveness may not be present.
>
I will have to take a closer look at what the glibc does. I always thought
that any locks it takes in user space are interruptible by signals, and
that libc calls only become uninterruptible when it calls kernel syscalls -
and those kernel syscalls cannot be interrupted (that was what I meant with
interrupts disabled). It may be wrong. Ill have to rethink this.
But whatever the outcome, there may be situations where a thread cannot be
interrupted by pthread_kill, but I think those cases are rare. More often
we just wait in very ordinary situations, be it a pthread mutex deadlock or
a slow file system. e.g. one of the more common scenarios is when you want
to print a stack trace and try to load symbol information to resolve an pc
to a name.
Thanks, Thomas
>
> Thanks,
> David
> -----
>
>
>
>>
>> Otherwise overall your changes look good, but I have a few comments.
>> Also, since this is an enhancement, it needs to wait for JDK 10.
>>
>> I think your test will fail for product builds. You should add
>> "@requires vm.debug == true". Also, java files use 4 char
>> indentation, not 2 like we use in hotspot C/C++ code. Lastly, it
>> should only have a 2016 copyright.
>>
>>
>> Thank you for the hints. Did fix all that. Note that I had disabled the
>> test for product builds in the code (!Platform.isDebugBuild()) but I
>> added the vm.debug tag as well as you suggested.
>>
>>
>> A couple of files need the copyright updated to 2016.
>>
>> Why do set_to_now() and get_timestamp() need to be atomic, and what
>> are the consequences of cx8 not being supported?
>>
>>
>> The error reporting thread sets the timestamp on each STEP start, and
>> the timestamp is read from another thread, the WatcherThread. Timestamp
>> is 64bit. I wanted to make sure the 64bit value is written and read
>> atomically, especially on 32bit platforms.
>>
>> But then, I had to check whether 64bit atomic stores/loads are even
>> supported by this platform (I actually did not find a 32bit platform
>> whithout 64bit atomics, but the comment in atomic.hpp is pretty
>> insistent and I did not want to risk regressions for other platforms).
>>
>> Well, if no cx8 support was available, I pretty much just give up and
>> read and write timestamps directly. As I said, I am not sure if this
>> code path gets ever executed.
>>
>> Maybe I was overthinking all this and just reading and writing the (C++
>> volatile) jlongs would have been enough, but I wanted to prevent
>> sporadic test errors because of incompletely read 64bit values.
>>
>>
>> 1282 st->print_raw_cr(buffer);
>> 1283 st->cr();
>>
>> The old code had an additional st->cr() before the above lines. I
>> assume you removed it intentionally.
>>
>>
>> I hope I preserved the numbers of cr(). At least that was my intention:
>>
>> 1260 outputStream* const st = log.is_open() ? &log : &out;
>> 1261 st->cr();
>>
>> ...
>>
>> and then on every path, a cr (or print_raw_cr) at the end. Where do you
>> see the missing cr()?
>>
>>
>>
>> Is there a reason why you decided to only allow one step to timeout.
>> What if the cause of a timeout in a step also impacts other steps,
>> or is that not common when we see timeouts?
>>
>>
>> That is mostly guesswork. In our (SAP) code we allow for four steps (so
>> ErrorLogTimeout/4 as step timeout) and additionally allow for "steps
>> known to be long" where timeouts are disabled altogether. But we also
>> have more complicated error reporting steps, so when porting the patch
>> to OpenJDK, I felt the complexity was unneeded.
>>
>> I think in general you will only have one misbehaving step, but you are
>> right, more than one step may timeout if e.g. the file system is slow.
>> I'm open for suggestions: the timeout value should be large enough not
>> to be hit for "normal slow steps" while still leave room enough for
>> other steps to finish. What do you think a reasonable timeout value
>> would be? ErrorLogTimeout/4?
>>
>>
>>
>> It's not clear to me why you changed a couple of os::sleep() calls
>> to os::naked_short_sleep(), and the rationale for the sleep periods.
>> Can you please explain?
>>
>>
>> Because os::sleep() does a lot of work under the hood and relies on a
>> bit of VM infrastructure. I think that is not a good idea in error
>> situations where potentially everything may be broken already. You want
>> to step lightly and really only do a naked system sleep. About the sleep
>> periods, os::naked_sleep has an inbuilt maximum value of 1000ms, which I
>> have to stay below to not hit the assert. I did use 999ms as the longest
>> interval I am allowed to sleep nakedly. And after the timeout hit and
>> before the WatcherThread calls os::abort, I again sleep 200ms to give
>> the error reporter thread time to write the "error log aborted due to
>> timeout" into the error log and to flush the error log. Those 200ms are
>> just guesswork.
>>
>>
>> thanks,
>>
>> Chris
>>
>>
>> Thanks for the review!
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>> On 10/12/16 9:55 PM, Thomas Stüfe wrote:
>>
>> Dear all,
>>
>> please take a look at the following fix:
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
>> <https://bugs.openjdk.java.net/browse/JDK-8166944>
>> webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/webrev.00/webrev/index.html
>> <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/webrev.00/webrev/index.html>
>>
>> ---
>>
>> In short, this fix provides the ability to cancel hanging error
>> reporting
>> steps. This uses the same code paths secondary error handling
>> uses during
>> error reporting. With this patch, steps which take too long will
>> be
>> canceled after 1/2 ErrorLogTimeout. In the log file, it will
>> look like this:
>>
>> 4 [timeout occurred during error reporting in step "<stepname>"]
>> after xxxx
>> ms.
>> 5
>>
>> and we now also get a finish message in the hs-err file if we
>> hit the
>> ErrorLogTimeout and error reporting will stop altogether:
>>
>> 6 ------ Timout during error reporting after xxx ms. ------
>>
>> (in addition to the "time expired, abort" message the
>> WatcherThread writes
>> to stderr)
>>
>> ---
>>
>> This is something which bugged us for a long time, because we
>> rely heavily
>> on the hs_err files for error analysis at customer sites, and
>> there are a
>> number of reasons why one step may hang and prevent the
>> follow-up steps
>> from running.
>>
>> It works like this:
>>
>> Before, when error reporting started, the WatcherThread was
>> waiting for
>> ErrorLogTimeout seconds, then would stop the VM.
>>
>> Now, the WatcherThread periodically pings error reporting, which
>> checks if
>> the last step did timeout. If it does, it sends a signal to the
>> reporting
>> thread, and the thread will continue with the next step. This
>> follows the
>> same path as secondary crash handling.
>>
>> Some implementation details:
>>
>> On Posix platforms, to interrupt the thread, I use pthread_kill.
>> This means
>> I must know the pthread id of the reporting thread, which I now
>> store at
>> the beginning of error reporting. We already store the reporting
>> thread id
>> in first_error_tid, but that I cannot use, because it gets set by
>> os::current_thread_id(), which is not always the pthread id.
>> Should we ever
>> switch to only using pthread id for posix platforms, this coding
>> can be
>> simplified.
>>
>> On Windows, there is unfortunately no easy way to interrupt a
>> non-cooperative thread. I would need a way to cause a SEH inside
>> the target
>> thread, which then would get handled by secondary error handling
>> like on
>> Posix platforms, but that is not easy. It is doable - one can
>> suspend the
>> thread, modify the thread context in a way that it will crash
>> upon resume.
>> But that felt a bit heavyweight for this problem. So on windows,
>> timeout
>> handling still works (after ErrorLogTimeout the VM gets shut
>> down), but
>> error reporting steps are not interruptable. If we feel this is
>> important,
>> this can be added later.
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>>
>>
More information about the hotspot-runtime-dev
mailing list