RFR: JDK-8293402: hs-err file printer should reattempt stack trace printing if it fails

Wed Nov 9 05:12:55 UTC 2022

On Tue, 6 Sep 2022 08:05:23 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

> Hi,
> 
> may I have reviews for this small improvement.
> 
> The call stack may be the most important part of an hs-err file. We recently introduced printing of source information (https://bugs.openjdk.org/browse/JDK-8242181) which is nice but makes stack printing more vulnerable for two reasons:
> - we may crash due to a programmer error (e.g. https://bugs.openjdk.org/browse/JDK-8293344)
> - we may timeout on very slow machines/file systems when the source information are parsed from the debug info (we have seen those problems in the past)
> 
> Therefore, VMError should retry stack printing without source information if the first attempt to print failed.
> 
> Examples:
> 
> Step timeouts while retrieving source info:
> 
> 
>  24 ---------------  T H R E A D  ---------------
>  25 
>  26 Current thread (0x00007f70ac028bd0):  JavaThread "main" [_thread_in_vm, id=565259, stack(0x00007f70b0587000,0x00007f70b0688000)]
>  27 
>  28 Stack: [0x00007f70b0587000,0x00007f70b0688000],  sp=0x00007f70b0686cf0,  free space=1023k
>  29 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
>  30 V  [libjvm.so+0x1cd41c1]  VMError::controlled_crash(int)+0x241
>  31 [timeout occurred during error reporting in step "printing native stack (with source info)"] after 30 s.
>  32 
>  33 Retrying call stack printing without source information...
>  34 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
>  35 V  [libjvm.so+0x1cd41c1]  VMError::controlled_crash(int)+0x241
>  36 V  [libjvm.so+0x11cbe45]  JNI_CreateJavaVM+0x5b5
>  37 C  [libjli.so+0x4013]  JavaMain+0x93
>  38 C  [libjli.so+0x800d]  ThreadJavaMain+0xd
>  39 
> 
> 
> 
> Step crashes while retrieving source info:
> 
> 
>  24 ---------------  T H R E A D  ---------------
>  25 
>  26 Current thread (0x00007fc000028bd0):  JavaThread "main" [_thread_in_vm, id=569254, stack(0x00007fc00573c000,0x00007fc00583d000)]
>  27 
>  28 Stack: [0x00007fc00573c000,0x00007fc00583d000],  sp=0x00007fc00583bcf0,  free space=1023k
>  29 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
>  30 V  [libjvm.so+0x1cd41e1]  VMError::controlled_crash(int)+0x241
>  31 [error occurred during error reporting (printing native stack (with source info)), id 0xb, SIGSEGV (0xb) at pc=0x00007fc006694d78]
>  32 
>  33 
>  34 Retrying call stack printing without source information...
>  35 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
>  36 V  [libjvm.so+0x1cd41e1]  VMError::controlled_crash(int)+0x241
>  37 V  [libjvm.so+0x11cbe65]  JNI_CreateJavaVM+0x5b5
>  38 C  [libjli.so+0x4013]  JavaMain+0x93
>  39 C  [libjli.so+0x800d]  ThreadJavaMain+0xd
> 
> 
> 
> Thanks, Thomas

Hi Axel,

> > Each time we crash out in VMError, we build up the stack. We never ever unwind that stack, eg. via longjmp, since that would introduce other errors (e.g. abandoned locks). There is a natural limit to how many recursive crashes we can handle since the stack is not endless. Each secondary crash increases the risk of running into guard pages and spoiling the game for follow-up STEPs.
> 
> I have no experience with stack depth being the problem in crashes (with the caveat that I have only run with this patch for a few weeks), but have experienced cases where only having a hs_err file available and having the register print_location bailing out early, where missing printing the rest has been unfavourable.

I don't doubt your experience, but for us the picture is different. We (SAP) have been working with this code for ~20yrs and hs-err files are one of our main tool of support. That is why we contributed so much hardening code, and why I am so interested. For us, torn hs-err files - by running either out of time, stack space, or against the recursive limit - are a frequent annoyance. Note that with stack space, you won't notice unless you explicitly investigate - the hs-err file is just either abridged or not there at all, but you won't get a clear message.

I believe the problem you want to solve is real, but maybe we can find a better solution than introducing a mass of new STEPs. Since by doing that you effectively reduce the chance of later STEPs running successfully. Stack and time are valuable resources all STEPs share, even if the recursive limit is a bit arbitrary. And at least for stack space, I don't see a solution - each crash will build up stack, and there is nothing to prevent this.

> 
> > Therefore we limit the number of allowed recursive crashes. This limit also serves a second purpose: if we crash that often, maybe we should just stop already and let the process die.
> 
> Fair, tough I am curious how we want to decide this limit, and why ~60 is fine but ~90 would be too much (I am guessing that most steps have no, or a very small possibility of crashing). 

No clue. This is quite old code from Sun times, I believe.

>Maybe this should instead be solved with a general solution which stops the reporting if some retry limit is reached.

How would this be different from what we are doing now?

> 
> Also the common case is that it does not crash repeatedly, and if it does, that is the scenario where I personally really would want the information, because something is seriously wrong.

I agree, but my conclusion differs from yours: if a STEP does X, and X crashes repeatedly out, it makes sense at some point to stop doing X, since the probability that it will crash again is quite high. It is better to stop and proceed with the next STEPs, since their output may be more valuable than just another failed printout in the crashing STEP. 

> But maybe not at the cost of stack overflows, if it is a problem maybe some stack address limit can used to disable reentry in reentrant steps.
> 
> > That brings me to the second problem, which is time. When we crash, we want to go down as fast as possible, e.g. allow a server to restart the node. OTOH we want a nice hs-err file. Therefore the time error handling is allowed to take is carefully limited. See `ErrorLogTimeout`: by default 2 Minutes, though our customers usually lower this to 30 seconds or even lower.
> > Each STEP has a timeout, set to a fraction of that total limit (A quarter). A quarter gives us room for 2-3 hanging STEPS and still leaves enough breathing room for the remainder of the STEPS.
> > If you now increase the number of STEPS, all these calculations are off. We may hit the recursive error limit much sooner, since every individual register printout may crash. And if they hang, they may eat up the ErrorLogTimeout much sooner. So we will get more torn hs-err files with "recursive limit reached, giving up" or "timeout reached, giving up".
> 
> The timeout problem was something I thought about as well, and I think you are correct, and that we should treat the whole reentrant step as one timeout. (Same behaviour as before).
> 
> > Note that one particularly fragile information is the printing of debug info, e.g. function name, etc. Since that relies on some parsing of debugging information. In our experience that can crash out or hang often, especially if the debug info has to be read from file or network.
> 
> Alright, I see this as an argument for reentrant steps with one timeout for all iterations of the inner loop combined.
> 
> I've heard opinions of something similar to reentrant steps in other parts of the hs_err printing. Like stack frame printing, where you can have iterative stages where each stage builds up more detailed information, until it crashes. And then prints what information it got so far.

I remember, we had this discussion before, and I brought up the same arguments.

I agree with @dean-long, it would be good to know what is going wrong first before changing the base mechanism. And harden the printing code instead, e.g. via SafeFetch/os::is_readable_pointer. Because SafeFetch, in contrast to this STEP mechanism, does not build up stack. Or, if applying SafeFetch is no reasonable option, retry the whole STEP with some of the riskier options disabled, like I did here: https://github.com/openjdk/jdk/pull/10179. 

Cheers, Thomas

-------------

PR: https://git.openjdk.org/jdk/pull/10179