RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

David Holmes david.holmes at oracle.com
Thu Oct 27 17:45:17 UTC 2016


Hi Thomas,

Totally agree your proposal makes things no better nor worse when it 
comes to what we do from the signal handler, and it may help with those 
deadlock situations. I'm not concerned about digging too deep into 
malloc to see whether it may or may not help in that particular case - 
it either will or it won't.

Overall I think this is looking quite good. I hope we get the JDK10 repo 
very soon ... once the JDK10 project officially takes off 9and probably 
once the repo consolidation project has settled on a final repo layout).

Thanks,
David

On 27/10/2016 5:16 PM, Thomas Stüfe wrote:
> Hi David,
>
> On Thu, Oct 27, 2016 at 2:40 AM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
>     On 27/10/2016 12:45 AM, Thomas Stüfe wrote:
>
>         Hi Chris,
>
>         Thanks for the review!
>
>         New
>         webrev:
>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.01/webrev/
>         <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.01/webrev/>
>
>
>     Have not looked at this yet.
>
>         Comments inline.
>
>         On Wed, Oct 26, 2016 at 9:00 AM, Chris Plummer
>         <chris.plummer at oracle.com <mailto:chris.plummer at oracle.com>
>         <mailto:chris.plummer at oracle.com
>         <mailto:chris.plummer at oracle.com>>> wrote:
>
>             Hi Tomas,
>
>             See JDK-8156821. I'm curious as to how your changes will
>         impact it,
>             since David says you can't interrupt a thread blocked trying to
>             acquire mutex. I suspect that means this enhancement won't
>         help in
>             this case, and presumably in general you are not fixing the
>         issue of
>             error reporting getting deadlocked, or maybe I'm misinterpreting
>             what David said in JDK-8156821.
>
>
>     That should be 8156823
>
>
>
>         Not sure what exactly David meant with "You can't "interrupt" a
>         thread
>         that is blocked trying to acquire a mutex." Maybe he can
>         elaborate :)
>
>         My guesses:
>
>         1) If he meant "you cannot interrupt a thread blocking in
>         pthread_mutex_lock()" - not true, you can and my patch works
>         just fine
>         in this situation. Just tested again, to be sure. This covers
>         crashes in
>         sections guarded by pthread_mutex, which then try to reaquire
>         the lock
>         in the error handler.
>
>
>     There is no specified, portable way to get a thread blocked
>     acquiring a mutex to stop waiting for the mutex. That is what I
>     meant. pthread_mutex_lock is not a cancellation point, nor will it
>     return EINTR in response to a signal.
>
>     However, if a signal is received by the thread while waiting then
>     POSIX semantics indicate that the signal handler will run and then
>     return the thread to the waiting state. In our case the crash
>     handler does not return so we are into undefined territory there -
>     but our crash handler is already not a well-defined signal handler
>     as it is not restricted to async-signal-safe functions, so we
>     already run a risk when executing it.
>
>
> That was what I meant. Syscalls have not to be interruptible per design,
> they just have to call user signal handlers for asynchronous signals.
>
> And I think my patch does not make matters worse or more unsafe. There
> is no new concept here - I use the pre-existing secondary signal
> handling and that only in situations in which otherwise the error
> handling would very probably hang forever, not producing any error log
> at all.
>
>
>     I had not considered this aspect in relation to 8156823, so the
>     proposed approach here would also attempt to address that issue.
>
>         2) If he meant "you cannot interrupt malloc if it is executing a
>         system
>         call in the linux kernel" - that may be. I am not a linux kernel
>         expert
>         but would have thought that syscalls may block if interrupts are
>         disabled for certain lengths by the syscall author. But in that
>         case i
>         would have expected the process to hang too and to be not killable?
>         Again, I am no expert.
>
>
>     Note "interrupt" here is a logical concept not related to hardware
>     level interrupts. I don't know at what point going into malloc you
>     will no longer get signal handlers run - malloc doesn't use pthread
>     level mutexes, but direct futuxes, so the same signal responsiveness
>     may not be present.
>
>
> I will have to take a closer look at what the glibc does. I always
> thought that any locks it takes in user space are interruptible by
> signals, and that libc calls only become uninterruptible when it calls
> kernel syscalls - and those kernel syscalls cannot be interrupted (that
> was what I meant with interrupts disabled). It may be wrong. Ill have to
> rethink this.
>
> But whatever the outcome, there may be situations where a thread cannot
> be interrupted by pthread_kill, but I think those cases are rare. More
> often we just wait in very ordinary situations, be it a pthread mutex
> deadlock or a slow file system. e.g. one of the more common scenarios is
> when you want to print a stack trace and try to load symbol information
> to resolve an pc to a name.
>
> Thanks, Thomas
>
>
>
>     Thanks,
>     David
>     -----
>
>
>
>
>             Otherwise overall your changes look good, but I have a few
>         comments.
>             Also, since this is an enhancement, it needs to wait for JDK 10.
>
>             I think your test will fail for product builds. You should add
>             "@requires vm.debug == true". Also, java files use 4 char
>             indentation, not 2 like we use in hotspot C/C++ code. Lastly, it
>             should only have a 2016 copyright.
>
>
>         Thank you for the hints. Did fix all that. Note that I had
>         disabled the
>         test for product builds in the code (!Platform.isDebugBuild()) but I
>         added the vm.debug tag as well as you suggested.
>
>
>             A couple of files need the copyright updated to 2016.
>
>             Why do set_to_now() and get_timestamp() need to be atomic,
>         and what
>             are the consequences of cx8 not being supported?
>
>
>         The error reporting thread sets the timestamp on each STEP
>         start, and
>         the timestamp is read from another thread, the WatcherThread.
>         Timestamp
>         is 64bit. I wanted to make sure the 64bit value is written and read
>         atomically, especially on 32bit platforms.
>
>         But then, I had to check whether 64bit atomic stores/loads are even
>         supported by this platform (I actually did not find a 32bit platform
>         whithout 64bit atomics, but the comment in atomic.hpp is pretty
>         insistent and I did not want to risk regressions for other
>         platforms).
>
>         Well, if no cx8 support was available, I pretty much just give
>         up and
>         read and write timestamps directly. As I said, I am not sure if this
>         code path gets ever executed.
>
>         Maybe I was overthinking all this and just reading and writing
>         the (C++
>         volatile) jlongs would have been enough, but I wanted to prevent
>         sporadic test errors because of incompletely read 64bit values.
>
>
>             1282         st->print_raw_cr(buffer);
>             1283         st->cr();
>
>             The old code had an additional st->cr() before the above
>         lines. I
>             assume you removed it intentionally.
>
>
>         I hope I preserved the numbers of cr(). At least that was my
>         intention:
>
>         1260       outputStream* const st = log.is_open() ? &log : &out;
>         1261       st->cr();
>
>         ...
>
>         and then on every path, a cr (or print_raw_cr) at the end. Where
>         do you
>         see the missing cr()?
>
>
>
>             Is there a reason why you decided to only allow one step to
>         timeout.
>             What if the cause of a timeout in a step also impacts other
>         steps,
>             or is that not common when we see timeouts?
>
>
>         That is mostly guesswork. In our (SAP) code we allow for four
>         steps (so
>         ErrorLogTimeout/4 as step timeout) and additionally allow for "steps
>         known to be long" where timeouts are disabled altogether. But we
>         also
>         have more complicated error reporting steps, so when porting the
>         patch
>         to OpenJDK, I felt the complexity was unneeded.
>
>         I think in general you will only have one misbehaving step, but
>         you are
>         right, more than one step may timeout if e.g. the file system is
>         slow.
>         I'm open for suggestions: the timeout value should be large
>         enough not
>         to be hit for "normal slow steps" while still leave room enough for
>         other steps to finish. What do you think a reasonable timeout value
>         would be? ErrorLogTimeout/4?
>
>
>
>             It's not clear to me why you changed a couple of os::sleep()
>         calls
>             to os::naked_short_sleep(), and the rationale for the sleep
>         periods.
>             Can you please explain?
>
>
>         Because os::sleep() does a lot of work under the hood and relies
>         on a
>         bit of VM infrastructure. I think that is not a good idea in error
>         situations where potentially everything may be broken already.
>         You want
>         to step lightly and really only do a naked system sleep. About
>         the sleep
>         periods, os::naked_sleep has an inbuilt maximum value of 1000ms,
>         which I
>         have to stay below to not hit the assert. I did use 999ms as the
>         longest
>         interval I am allowed to sleep nakedly. And after the timeout
>         hit and
>         before the WatcherThread calls os::abort, I again sleep 200ms to
>         give
>         the error reporter thread time to write the "error log aborted
>         due to
>         timeout" into the error log and to flush the error log. Those
>         200ms are
>         just guesswork.
>
>
>             thanks,
>
>             Chris
>
>
>         Thanks for the review!
>
>         Kind Regards, Thomas
>
>
>
>
>             On 10/12/16 9:55 PM, Thomas Stüfe wrote:
>
>                 Dear all,
>
>                 please take a look at the following fix:
>
>                 Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
>         <https://bugs.openjdk.java.net/browse/JDK-8166944>
>                 <https://bugs.openjdk.java.net/browse/JDK-8166944
>         <https://bugs.openjdk.java.net/browse/JDK-8166944>>
>                 webrev:
>
>         http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html
>         <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html>
>
>         <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html
>         <http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.00/webrev/index.html>>
>
>                 ---
>
>                 In short, this fix provides the ability to cancel
>         hanging error
>                 reporting
>                 steps. This uses the same code paths secondary error
>         handling
>                 uses during
>                 error reporting. With this patch, steps which take too
>         long will be
>                 canceled after 1/2 ErrorLogTimeout. In the log file, it will
>                 look like this:
>
>                 4 [timeout occurred during error reporting in step
>         "<stepname>"]
>                 after xxxx
>                 ms.
>                 5
>
>                 and we now also get a finish message in the hs-err file
>         if we
>                 hit the
>                 ErrorLogTimeout and error reporting will stop altogether:
>
>                 6 ------ Timout during error reporting after xxx ms. ------
>
>                 (in addition to the "time expired, abort" message the
>                 WatcherThread writes
>                 to stderr)
>
>                 ---
>
>                 This is something which bugged us for a long time,
>         because we
>                 rely heavily
>                 on the hs_err files for error analysis at customer
>         sites, and
>                 there are a
>                 number of reasons why one step may hang and prevent the
>                 follow-up steps
>                 from running.
>
>                 It works like this:
>
>                 Before, when error reporting started, the WatcherThread was
>                 waiting for
>                 ErrorLogTimeout seconds, then would stop the VM.
>
>                 Now, the WatcherThread periodically pings error
>         reporting, which
>                 checks if
>                 the last step did timeout. If it does, it sends a signal
>         to the
>                 reporting
>                 thread, and the thread will continue with the next step.
>         This
>                 follows the
>                 same path as secondary crash handling.
>
>                 Some implementation details:
>
>                 On Posix platforms, to interrupt the thread, I use
>         pthread_kill.
>                 This means
>                 I must know the pthread id of the reporting thread,
>         which I now
>                 store at
>                 the beginning of error reporting. We already store the
>         reporting
>                 thread id
>                 in first_error_tid, but that I cannot use, because it
>         gets set by
>                 os::current_thread_id(), which is not always the pthread id.
>                 Should we ever
>                 switch to only using pthread id for posix platforms,
>         this coding
>                 can be
>                 simplified.
>
>                 On Windows, there is unfortunately no easy way to
>         interrupt a
>                 non-cooperative thread. I would need a way to cause a
>         SEH inside
>                 the target
>                 thread, which then would get handled by secondary error
>         handling
>                 like on
>                 Posix platforms, but that is not easy. It is doable -
>         one can
>                 suspend the
>                 thread, modify the thread context in a way that it will
>         crash
>                 upon resume.
>                 But that felt a bit heavyweight for this problem. So on
>         windows,
>                 timeout
>                 handling still works (after ErrorLogTimeout the VM gets shut
>                 down), but
>                 error reporting steps are not interruptable. If we feel
>         this is
>                 important,
>                 this can be added later.
>
>                 Kind Regards, Thomas
>
>
>
>
>
>


More information about the hotspot-runtime-dev mailing list