RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Wed Oct 26 14:45:40 UTC 2016

Hi Chris,

Thanks for the review!

New webrev:
http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-Error-Reporting/webrev.01/webrev/

Comments inline.

On Wed, Oct 26, 2016 at 9:00 AM, Chris Plummer <chris.plummer at oracle.com>
wrote:

> Hi Tomas,
>
> See JDK-8156821. I'm curious as to how your changes will impact it, since
> David says you can't interrupt a thread blocked trying to acquire mutex. I
> suspect that means this enhancement won't help in this case, and presumably
> in general you are not fixing the issue of error reporting getting
> deadlocked, or maybe I'm misinterpreting what David said in JDK-8156821.
>

Not sure what exactly David meant with "You can't "interrupt" a thread that
is blocked trying to acquire a mutex." Maybe he can elaborate :)

My guesses:

1) If he meant "you cannot interrupt a thread blocking in
pthread_mutex_lock()" - not true, you can and my patch works just fine in
this situation. Just tested again, to be sure. This covers crashes in
sections guarded by pthread_mutex, which then try to reaquire the lock in
the error handler.

2) If he meant "you cannot interrupt malloc if it is executing a system
call in the linux kernel" - that may be. I am not a linux kernel expert but
would have thought that syscalls may block if interrupts are disabled for
certain lengths by the syscall author. But in that case i would have
expected the process to hang too and to be not killable? Again, I am no
expert.

>
> Otherwise overall your changes look good, but I have a few comments. Also,
> since this is an enhancement, it needs to wait for JDK 10.
>
> I think your test will fail for product builds. You should add "@requires
> vm.debug == true". Also, java files use 4 char indentation, not 2 like we
> use in hotspot C/C++ code. Lastly, it should only have a 2016 copyright.
>
>
Thank you for the hints. Did fix all that. Note that I had disabled the
test for product builds in the code (!Platform.isDebugBuild()) but I added
the vm.debug tag as well as you suggested.

> A couple of files need the copyright updated to 2016.
>
> Why do set_to_now() and get_timestamp() need to be atomic, and what are
> the consequences of cx8 not being supported?
>
>
The error reporting thread sets the timestamp on each STEP start, and the
timestamp is read from another thread, the WatcherThread. Timestamp is
64bit. I wanted to make sure the 64bit value is written and read
atomically, especially on 32bit platforms.

But then, I had to check whether 64bit atomic stores/loads are even
supported by this platform (I actually did not find a 32bit platform
whithout 64bit atomics, but the comment in atomic.hpp is pretty insistent
and I did not want to risk regressions for other platforms).

Well, if no cx8 support was available, I pretty much just give up and read
and write timestamps directly. As I said, I am not sure if this code path
gets ever executed.

Maybe I was overthinking all this and just reading and writing the (C++
volatile) jlongs would have been enough, but I wanted to prevent sporadic
test errors because of incompletely read 64bit values.

> 1282         st->print_raw_cr(buffer);
> 1283         st->cr();
>
> The old code had an additional st->cr() before the above lines. I assume
> you removed it intentionally.
>
>
I hope I preserved the numbers of cr(). At least that was my intention:

1260       outputStream* const st = log.is_open() ? &log : &out;
1261       st->cr();

...

and then on every path, a cr (or print_raw_cr) at the end. Where do you see
the missing cr()?

> Is there a reason why you decided to only allow one step to timeout. What
> if the cause of a timeout in a step also impacts other steps, or is that
> not common when we see timeouts?
>
>
That is mostly guesswork. In our (SAP) code we allow for four steps (so
ErrorLogTimeout/4 as step timeout) and additionally allow for "steps known
to be long" where timeouts are disabled altogether. But we also have more
complicated error reporting steps, so when porting the patch to OpenJDK, I
felt the complexity was unneeded.

I think in general you will only have one misbehaving step, but you are
right, more than one step may timeout if e.g. the file system is slow. I'm
open for suggestions: the timeout value should be large enough not to be
hit for "normal slow steps" while still leave room enough for other steps
to finish. What do you think a reasonable timeout value would be?
ErrorLogTimeout/4?

> It's not clear to me why you changed a couple of os::sleep() calls to
> os::naked_short_sleep(), and the rationale for the sleep periods. Can you
> please explain?
>
>
Because os::sleep() does a lot of work under the hood and relies on a bit
of VM infrastructure. I think that is not a good idea in error situations
where potentially everything may be broken already. You want to step
lightly and really only do a naked system sleep. About the sleep periods,
os::naked_sleep has an inbuilt maximum value of 1000ms, which I have to
stay below to not hit the assert. I did use 999ms as the longest interval I
am allowed to sleep nakedly. And after the timeout hit and before the
WatcherThread calls os::abort, I again sleep 200ms to give the error
reporter thread time to write the "error log aborted due to timeout" into
the error log and to flush the error log. Those 200ms are just guesswork.

> thanks,
>
> Chris
>
>
Thanks for the review!

Kind Regards, Thomas

>
> On 10/12/16 9:55 PM, Thomas Stüfe wrote:
>
>> Dear all,
>>
>> please take a look at the following fix:
>>
>> Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
>> webrev:
>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-
>> Error-Reporting/webrev.00/webrev/index.html
>>
>> ---
>>
>> In short, this fix provides the ability to cancel hanging error reporting
>> steps. This uses the same code paths secondary error handling uses during
>> error reporting. With this patch, steps which take too long will be
>> canceled after 1/2 ErrorLogTimeout. In the log file, it will look like
>> this:
>>
>> 4 [timeout occurred during error reporting in step "<stepname>"] after
>> xxxx
>> ms.
>> 5
>>
>> and we now also get a finish message in the hs-err file if we hit the
>> ErrorLogTimeout and error reporting will stop altogether:
>>
>> 6 ------ Timout during error reporting after xxx ms. ------
>>
>> (in addition to the "time expired, abort" message the WatcherThread writes
>> to stderr)
>>
>> ---
>>
>> This is something which bugged us for a long time, because we rely heavily
>> on the hs_err files for error analysis at customer sites, and there are a
>> number of reasons why one step may hang and prevent the follow-up steps
>> from running.
>>
>> It works like this:
>>
>> Before, when error reporting started, the WatcherThread was waiting for
>> ErrorLogTimeout seconds, then would stop the VM.
>>
>> Now, the WatcherThread periodically pings error reporting, which checks if
>> the last step did timeout. If it does, it sends a signal to the reporting
>> thread, and the thread will continue with the next step. This follows the
>> same path as secondary crash handling.
>>
>> Some implementation details:
>>
>> On Posix platforms, to interrupt the thread, I use pthread_kill. This
>> means
>> I must know the pthread id of the reporting thread, which I now store at
>> the beginning of error reporting. We already store the reporting thread id
>> in first_error_tid, but that I cannot use, because it gets set by
>> os::current_thread_id(), which is not always the pthread id. Should we
>> ever
>> switch to only using pthread id for posix platforms, this coding can be
>> simplified.
>>
>> On Windows, there is unfortunately no easy way to interrupt a
>> non-cooperative thread. I would need a way to cause a SEH inside the
>> target
>> thread, which then would get handled by secondary error handling like on
>> Posix platforms, but that is not easy. It is doable - one can suspend the
>> thread, modify the thread context in a way that it will crash upon resume.
>> But that felt a bit heavyweight for this problem. So on windows, timeout
>> handling still works (after ErrorLogTimeout the VM gets shut down), but
>> error reporting steps are not interruptable. If we feel this is important,
>> this can be added later.
>>
>> Kind Regards, Thomas
>>
>
>
>
>