RFR(s): 8166944: Hanging Error Reporting steps may lead to torn error logs.

Thu Oct 27 06:09:23 UTC 2016

Hi Chris,

On Wed, Oct 26, 2016 at 9:27 PM, Chris Plummer <chris.plummer at oracle.com>
wrote:

> Hi Thomas,
>
> On 10/26/16 7:45 AM, Thomas Stüfe wrote:
>
> Hi Chris,
>
> Thanks for the review!
>
> New webrev: http://cr.openjdk.java.net/~stuefe/webrevs/
> 8166944-Hanging-Error-Reporting/webrev.01/webrev/
>
> Comments inline.
>
> On Wed, Oct 26, 2016 at 9:00 AM, Chris Plummer <chris.plummer at oracle.com>
> wrote:
>
>> Hi Tomas,
>>
>> See JDK-8156821. I'm curious as to how your changes will impact it, since
>> David says you can't interrupt a thread blocked trying to acquire mutex. I
>> suspect that means this enhancement won't help in this case, and presumably
>> in general you are not fixing the issue of error reporting getting
>> deadlocked, or maybe I'm misinterpreting what David said in JDK-8156821.
>>
>
> Not sure what exactly David meant with "You can't "interrupt" a thread
> that is blocked trying to acquire a mutex." Maybe he can elaborate :)
>
> My guesses:
>
> 1) If he meant "you cannot interrupt a thread blocking in
> pthread_mutex_lock()" - not true, you can and my patch works just fine in
> this situation. Just tested again, to be sure. This covers crashes in
> sections guarded by pthread_mutex, which then try to reaquire the lock in
> the error handler.
>
> 2) If he meant "you cannot interrupt malloc if it is executing a system
> call in the linux kernel" - that may be. I am not a linux kernel expert
> but would have thought that syscalls may block if interrupts are disabled
> for certain lengths by the syscall author. But in that case i would have
> expected the process to hang too and to be not killable? Again, I am no
> expert.
>
> Ok. I'll let David explain once he's available.
>
>
>
>>
>> Otherwise overall your changes look good, but I have a few comments.
>> Also, since this is an enhancement, it needs to wait for JDK 10.
>>
>> I think your test will fail for product builds. You should add "@requires
>> vm.debug == true". Also, java files use 4 char indentation, not 2 like we
>> use in hotspot C/C++ code. Lastly, it should only have a 2016 copyright.
>>
>>
> Thank you for the hints. Did fix all that. Note that I had disabled the
> test for product builds in the code (!Platform.isDebugBuild()) but I
> added the vm.debug tag as well as you suggested.
>
> Ah, sorry I missed that, but IMHO the Platform checks should only be used
> to alter test behavior, not completely disable the entire test. @requires
> is best for disabling a test for certain platforms and builds. You should
> probably remove the Platform checks and also add 'os.family != "windows"'
> to the @requires line.
>

Good point, thank you! Will adjust the test accordingly.

>
>
>
>> A couple of files need the copyright updated to 2016.
>>
>> Why do set_to_now() and get_timestamp() need to be atomic, and what are
>> the consequences of cx8 not being supported?
>>
>>
> The error reporting thread sets the timestamp on each STEP start, and the
> timestamp is read from another thread, the WatcherThread. Timestamp is
> 64bit. I wanted to make sure the 64bit value is written and read
> atomically, especially on 32bit platforms.
>
> But then, I had to check whether 64bit atomic stores/loads are even
> supported by this platform (I actually did not find a 32bit platform
> whithout 64bit atomics, but the comment in atomic.hpp is pretty insistent
> and I did not want to risk regressions for other platforms).
>
> Well, if no cx8 support was available, I pretty much just give up and read
> and write timestamps directly. As I said, I am not sure if this code path
> gets ever executed.
>
> Maybe I was overthinking all this and just reading and writing the (C++
> volatile) jlongs would have been enough, but I wanted to prevent sporadic
> test errors because of incompletely read 64bit values.
>
> Closed ports may not have cx8 support, although I don't believe any are
> being released with JDK9. Since you just have one writer and one reader, I
> think the only concern is word tearing on the read. For this reason you
> likely need the cx8 support. David would know, so hopefully he can comment
> on this.
>
> Assuming you need cx8 support, theoretically there are platforms where
> your code could fail due to not having cx8 support. You could argue that
> the risk of word tearing is minimal, both in likelihood of it happening
> (race condition on a platform we aren't currently officially supporting),
> and the possible negative behavior if it does (premature timeout or
> possibly no timeout, but only with debug builds after a crash).
>
> The other choice here is to just disable the whole timeout mechanism if
> cx8 is not supported. In fact simply making set_to_now() and
> get_timestamp() no-ops when cx8 is not supported would accomplish that,
> although I'd suggest also adding some more explicit disabling of the code
> wherever the timestamps are referenced.
>

Another thing I thought of would be to change the timestamp to 32bit - I
only need second resolution - and handle somehow the year 2038 overflow.
But the most easy and pragmatic way is to either ignore the problem
completely for non-cx8 or to do anything.

>
> BTW, the statics you added should probably all be made fields of VMError
> rather than in the global scope.
>
>
>
>> 1282         st->print_raw_cr(buffer);
>> 1283         st->cr();
>>
>> The old code had an additional st->cr() before the above lines. I assume
>> you removed it intentionally.
>>
>>
> I hope I preserved the numbers of cr(). At least that was my intention:
>
> 1260       outputStream* const st = log.is_open() ? &log : &out;
> 1261       st->cr();
>
> ...
>
> and then on every path, a cr (or print_raw_cr) at the end. Where do you
> see the missing cr()?
>
> Ok. It's just moved up about 20 lines of code now so I missed it.
>
>
>
>
>> Is there a reason why you decided to only allow one step to timeout. What
>> if the cause of a timeout in a step also impacts other steps, or is that
>> not common when we see timeouts?
>>
>>
> That is mostly guesswork. In our (SAP) code we allow for four steps (so
> ErrorLogTimeout/4 as step timeout) and additionally allow for "steps known
> to be long" where timeouts are disabled altogether. But we also have more
> complicated error reporting steps, so when porting the patch to OpenJDK, I
> felt the complexity was unneeded.
>
> I think in general you will only have one misbehaving step, but you are
> right, more than one step may timeout if e.g. the file system is slow. I'm
> open for suggestions: the timeout value should be large enough not to be
> hit for "normal slow steps" while still leave room enough for other steps
> to finish. What do you think a reasonable timeout value would be?
> ErrorLogTimeout/4?
>
> I don't think we've run into the "slow steps" case causing timeout, just
> the deadlocks, so I don't really have any data to give you. If you are
> primarily concerned about deadlocks, then you want ErrorLogTimeout div a
> fairly large number. If you are mostly addressing slow steps, then you div
> with a smallish number. I think I'd prefer /4 over /2, maybe even bigger.
>

I'll think about it. Maybe 1/4 is a good compromise.

>
>
>
>
>> It's not clear to me why you changed a couple of os::sleep() calls to
>> os::naked_short_sleep(), and the rationale for the sleep periods. Can you
>> please explain?
>>
>>
> Because os::sleep() does a lot of work under the hood and relies on a bit
> of VM infrastructure. I think that is not a good idea in error situations
> where potentially everything may be broken already. You want to step
> lightly and really only do a naked system sleep. About the sleep periods,
> os::naked_sleep has an inbuilt maximum value of 1000ms, which I have to
> stay below to not hit the assert. I did use 999ms as the longest interval I
> am allowed to sleep nakedly. And after the timeout hit and before the
> WatcherThread calls os::abort, I again sleep 200ms to give the error
> reporter thread time to write the "error log aborted due to timeout" into
> the error log and to flush the error log. Those 200ms are just guesswork.
>
> Ok. I'd like to see someone else comment on the os::naked_sleep() use
> since it's not something I'm familiar enough with.
>
> thanks,
>
> Chris
>

Thanks, Chris.

 Thomas

>
>
>
>> thanks,
>>
>> Chris
>>
>>
> Thanks for the review!
>
> Kind Regards, Thomas
>
>
>
>>
>> On 10/12/16 9:55 PM, Thomas Stüfe wrote:
>>
>>> Dear all,
>>>
>>> please take a look at the following fix:
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8166944
>>> webrev:
>>> http://cr.openjdk.java.net/~stuefe/webrevs/8166944-Hanging-E
>>> rror-Reporting/webrev.00/webrev/index.html
>>>
>>> ---
>>>
>>> In short, this fix provides the ability to cancel hanging error reporting
>>> steps. This uses the same code paths secondary error handling uses during
>>> error reporting. With this patch, steps which take too long will be
>>> canceled after 1/2 ErrorLogTimeout. In the log file, it will look like
>>> this:
>>>
>>> 4 [timeout occurred during error reporting in step "<stepname>"] after
>>> xxxx
>>> ms.
>>> 5
>>>
>>> and we now also get a finish message in the hs-err file if we hit the
>>> ErrorLogTimeout and error reporting will stop altogether:
>>>
>>> 6 ------ Timout during error reporting after xxx ms. ------
>>>
>>> (in addition to the "time expired, abort" message the WatcherThread
>>> writes
>>> to stderr)
>>>
>>> ---
>>>
>>> This is something which bugged us for a long time, because we rely
>>> heavily
>>> on the hs_err files for error analysis at customer sites, and there are a
>>> number of reasons why one step may hang and prevent the follow-up steps
>>> from running.
>>>
>>> It works like this:
>>>
>>> Before, when error reporting started, the WatcherThread was waiting for
>>> ErrorLogTimeout seconds, then would stop the VM.
>>>
>>> Now, the WatcherThread periodically pings error reporting, which checks
>>> if
>>> the last step did timeout. If it does, it sends a signal to the reporting
>>> thread, and the thread will continue with the next step. This follows the
>>> same path as secondary crash handling.
>>>
>>> Some implementation details:
>>>
>>> On Posix platforms, to interrupt the thread, I use pthread_kill. This
>>> means
>>> I must know the pthread id of the reporting thread, which I now store at
>>> the beginning of error reporting. We already store the reporting thread
>>> id
>>> in first_error_tid, but that I cannot use, because it gets set by
>>> os::current_thread_id(), which is not always the pthread id. Should we
>>> ever
>>> switch to only using pthread id for posix platforms, this coding can be
>>> simplified.
>>>
>>> On Windows, there is unfortunately no easy way to interrupt a
>>> non-cooperative thread. I would need a way to cause a SEH inside the
>>> target
>>> thread, which then would get handled by secondary error handling like on
>>> Posix platforms, but that is not easy. It is doable - one can suspend the
>>> thread, modify the thread context in a way that it will crash upon
>>> resume.
>>> But that felt a bit heavyweight for this problem. So on windows, timeout
>>> handling still works (after ErrorLogTimeout the VM gets shut down), but
>>> error reporting steps are not interruptable. If we feel this is
>>> important,
>>> this can be added later.
>>>
>>> Kind Regards, Thomas
>>>
>>
>>
>>
>>
>
>