RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thu Oct 27 03:22:35 UTC 2016

On 10/26/16 5:40 PM, David Holmes wrote:
> On 27/10/2016 12:45 AM, Thomas Stüfe wrote:
>> Hi Chris,
>>
>> Thanks for the review!
>>
>> New
>> webrev: 
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.01/webrev/
>
> Have not looked at this yet.
>
>> Comments inline.
>>
>> On Wed, Oct 26, 2016 at 9:00 AM, Chris Plummer <chris.plummer at oracle.com
>> <mailto:chris.plummer at oracle.com>> wrote:
>>
>>     Hi Tomas,
>>
>>     See JDK-8156821. I'm curious as to how your changes will impact it,
>>     since David says you can't interrupt a thread blocked trying to
>>     acquire mutex. I suspect that means this enhancement won't help in
>>     this case, and presumably in general you are not fixing the issue of
>>     error reporting getting deadlocked, or maybe I'm misinterpreting
>>     what David said in JDK-8156821.
>
> That should be 8156823
>
>>
>>
>> Not sure what exactly David meant with "You can't "interrupt" a thread
>> that is blocked trying to acquire a mutex." Maybe he can elaborate :)
>>
>> My guesses:
>>
>> 1) If he meant "you cannot interrupt a thread blocking in
>> pthread_mutex_lock()" - not true, you can and my patch works just fine
>> in this situation. Just tested again, to be sure. This covers crashes in
>> sections guarded by pthread_mutex, which then try to reaquire the lock
>> in the error handler.
>
> There is no specified, portable way to get a thread blocked acquiring 
> a mutex to stop waiting for the mutex. That is what I meant. 
> pthread_mutex_lock is not a cancellation point, nor will it return 
> EINTR in response to a signal.
>
> However, if a signal is received by the thread while waiting then 
> POSIX semantics indicate that the signal handler will run and then 
> return the thread to the waiting state. In our case the crash handler 
> does not return so we are into undefined territory there - but our 
> crash handler is already not a well-defined signal handler as it is 
> not restricted to async-signal-safe functions, so we already run a 
> risk when executing it.
>
> I had not considered this aspect in relation to 8156823, so the 
> proposed approach here would also attempt to address that issue.
>
>> 2) If he meant "you cannot interrupt malloc if it is executing a system
>> call in the linux kernel" - that may be. I am not a linux kernel expert
>> but would have thought that syscalls may block if interrupts are
>> disabled for certain lengths by the syscall author. But in that case i
>> would have expected the process to hang too and to be not killable?
>> Again, I am no expert.
>
> Note "interrupt" here is a logical concept not related to hardware 
> level interrupts. I don't know at what point going into malloc you 
> will no longer get signal handlers run - malloc doesn't use pthread 
> level mutexes, but direct futuxes, so the same signal responsiveness 
> may not be present.
It probably would not be all that hard to create the malloc crash from 
8156823 and then see how Thomas' changes impact how VMError handles it. 
Just find an appropriate place in the VM to malloc a chunk of memory, 
step all over the bytes before and after it, and then call free.

Chris
>
>
> Thanks,
> David
> -----
>
>>
>>
>>     Otherwise overall your changes look good, but I have a few comments.
>>     Also, since this is an enhancement, it needs to wait for JDK 10.
>>
>>     I think your test will fail for product builds. You should add
>>     "@requires vm.debug == true". Also, java files use 4 char
>>     indentation, not 2 like we use in hotspot C/C++ code. Lastly, it
>>     should only have a 2016 copyright.
>>
>>
>> Thank you for the hints. Did fix all that. Note that I had disabled the
>> test for product builds in the code (!Platform.isDebugBuild()) but I
>> added the vm.debug tag as well as you suggested.
>>
>>
>>     A couple of files need the copyright updated to 2016.
>>
>>     Why do set_to_now() and get_timestamp() need to be atomic, and what
>>     are the consequences of cx8 not being supported?
>>
>>
>> The error reporting thread sets the timestamp on each STEP start, and
>> the timestamp is read from another thread, the WatcherThread. Timestamp
>> is 64bit. I wanted to make sure the 64bit value is written and read
>> atomically, especially on 32bit platforms.
>>
>> But then, I had to check whether 64bit atomic stores/loads are even
>> supported by this platform (I actually did not find a 32bit platform
>> whithout 64bit atomics, but the comment in atomic.hpp is pretty
>> insistent and I did not want to risk regressions for other platforms).
>>
>> Well, if no cx8 support was available, I pretty much just give up and
>> read and write timestamps directly. As I said, I am not sure if this
>> code path gets ever executed.
>>
>> Maybe I was overthinking all this and just reading and writing the (C++
>> volatile) jlongs would have been enough, but I wanted to prevent
>> sporadic test errors because of incompletely read 64bit values.
>>
>>
>>     1282         st->print_raw_cr(buffer);
>>     1283         st->cr();
>>
>>     The old code had an additional st->cr() before the above lines. I
>>     assume you removed it intentionally.
>>
>>
>> I hope I preserved the numbers of cr(). At least that was my intention:
>>
>> 1260       outputStream* const st = log.is_open() ? &log : &out;
>> 1261       st->cr();
>>
>> ...
>>
>> and then on every path, a cr (or print_raw_cr) at the end. Where do you
>> see the missing cr()?
>>
>>
>>
>>     Is there a reason why you decided to only allow one step to timeout.
>>     What if the cause of a timeout in a step also impacts other steps,
>>     or is that not common when we see timeouts?
>>
>>
>> That is mostly guesswork. In our (SAP) code we allow for four steps (so
>> ErrorLogTimeout/4 as step timeout) and additionally allow for "steps
>> known to be long" where timeouts are disabled altogether. But we also
>> have more complicated error reporting steps, so when porting the patch
>> to OpenJDK, I felt the complexity was unneeded.
>>
>> I think in general you will only have one misbehaving step, but you are
>> right, more than one step may timeout if e.g. the file system is slow.
>> I'm open for suggestions: the timeout value should be large enough not
>> to be hit for "normal slow steps" while still leave room enough for
>> other steps to finish. What do you think a reasonable timeout value
>> would be? ErrorLogTimeout/4?
>>
>>
>>
>>     It's not clear to me why you changed a couple of os::sleep() calls
>>     to os::naked_short_sleep(), and the rationale for the sleep periods.
>>     Can you please explain?
>>
>>
>> Because os::sleep() does a lot of work under the hood and relies on a
>> bit of VM infrastructure. I think that is not a good idea in error
>> situations where potentially everything may be broken already. You want
>> to step lightly and really only do a naked system sleep. About the sleep
>> periods, os::naked_sleep has an inbuilt maximum value of 1000ms, which I
>> have to stay below to not hit the assert. I did use 999ms as the longest
>> interval I am allowed to sleep nakedly. And after the timeout hit and
>> before the WatcherThread calls os::abort, I again sleep 200ms to give
>> the error reporter thread time to write the "error log aborted due to
>> timeout" into the error log and to flush the error log. Those 200ms are
>> just guesswork.
>>
>>
>>     thanks,
>>
>>     Chris
>>
>>
>> Thanks for the review!
>>
>> Kind Regards, Thomas
>>
>>
>>
>>
>>     On 10/12/16 9:55 PM, Thomas Stüfe wrote:
>>
>>         Dear all,
>>
>>         please take a look at the following fix:
>>
>>         Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
>>         <https://bugs.openjdk.java.net/browse/JDK-8166944>
>>         webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html
>> <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html>
>>
>>         ---
>>
>>         In short, this fix provides the ability to cancel hanging error
>>         reporting
>>         steps. This uses the same code paths secondary error handling
>>         uses during
>>         error reporting. With this patch, steps which take too long 
>> will be
>>         canceled after 1/2 ErrorLogTimeout. In the log file, it will
>>         look like this:
>>
>>         4 [timeout occurred during error reporting in step "<stepname>"]
>>         after xxxx
>>         ms.
>>         5
>>
>>         and we now also get a finish message in the hs-err file if we
>>         hit the
>>         ErrorLogTimeout and error reporting will stop altogether:
>>
>>         6 ------ Timout during error reporting after xxx ms. ------
>>
>>         (in addition to the "time expired, abort" message the
>>         WatcherThread writes
>>         to stderr)
>>
>>         ---
>>
>>         This is something which bugged us for a long time, because we
>>         rely heavily
>>         on the hs_err files for error analysis at customer sites, and
>>         there are a
>>         number of reasons why one step may hang and prevent the
>>         follow-up steps
>>         from running.
>>
>>         It works like this:
>>
>>         Before, when error reporting started, the WatcherThread was
>>         waiting for
>>         ErrorLogTimeout seconds, then would stop the VM.
>>
>>         Now, the WatcherThread periodically pings error reporting, which
>>         checks if
>>         the last step did timeout. If it does, it sends a signal to the
>>         reporting
>>         thread, and the thread will continue with the next step. This
>>         follows the
>>         same path as secondary crash handling.
>>
>>         Some implementation details:
>>
>>         On Posix platforms, to interrupt the thread, I use pthread_kill.
>>         This means
>>         I must know the pthread id of the reporting thread, which I now
>>         store at
>>         the beginning of error reporting. We already store the reporting
>>         thread id
>>         in first_error_tid, but that I cannot use, because it gets 
>> set by
>>         os::current_thread_id(), which is not always the pthread id.
>>         Should we ever
>>         switch to only using pthread id for posix platforms, this coding
>>         can be
>>         simplified.
>>
>>         On Windows, there is unfortunately no easy way to interrupt a
>>         non-cooperative thread. I would need a way to cause a SEH inside
>>         the target
>>         thread, which then would get handled by secondary error handling
>>         like on
>>         Posix platforms, but that is not easy. It is doable - one can
>>         suspend the
>>         thread, modify the thread context in a way that it will crash
>>         upon resume.
>>         But that felt a bit heavyweight for this problem. So on windows,
>>         timeout
>>         handling still works (after ErrorLogTimeout the VM gets shut
>>         down), but
>>         error reporting steps are not interruptable. If we feel this is
>>         important,
>>         this can be added later.
>>
>>         Kind Regards, Thomas
>>
>>
>>
>>
>>