RFR: 8296469: Instrument VMError::report with reentrant iteration step for register and stack printing

Wed Nov 9 07:27:08 UTC 2022

On Mon, 7 Nov 2022 13:24:26 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

> Add reentrant step logic to VMError::report with an inner loop which enable the logic to recover at every step of the iteration.
> 
> Before this change, if printing one register/stack position crashes then no more registers/stack positions will be printed.
> 
> After this change even if the VM is unstable and some registers print_location crashes the hs_err printing will recover and keep attempting to print the rest of the registers or stack values.
> 
> Enables the following
> ```C++
> REENTRANT_STEP_IF("printing register info", _verbose && _context && _thread && Universe::is_fully_initialized())
>   os::print_register_info_header(st, _context);
> 
>   REENTRANT_LOOP_START(os::print_nth_register_info_max_index())
>     // decode register contents if possible
>     ResourceMark rm(_thread);
>     os::print_nth_register_info(st, REENTRANT_ITERATION_STEP, _context);
>   REENTRANT_LOOP_END
> 
>   st->cr();
> 
> 
> Testing: tier 1 and compiled Linux-x64/aarch64, MacOS-x64/aarch64, Windows x64 and cross-compiled Linux-x86/riscv/arm/ppc/s390x (GHA and some local)

Hi Axel,

sorry for the delay, I believe I posted the answer into the wrong PR.

> > Each time we crash out in VMError, we build up the stack. We never ever unwind that stack, eg. via longjmp, since that would introduce other errors (e.g. abandoned locks). There is a natural limit to how many recursive crashes we can handle since the stack is not endless. Each secondary crash increases the risk of running into guard pages and spoiling the game for follow-up STEPs.
>
> I have no experience with stack depth being the problem in crashes (with the caveat that I have only run with this patch for a few weeks), but have experienced cases where only having a hs_err file available and having the register print_location bailing out early, where missing printing the rest has been unfavourable.

I don't doubt your experience, but for us the picture is different. We (SAP) have been working with this code for ~20yrs and hs-err files are one of our main tool of support. That is why we contributed so much hardening code, and why I am so interested. For us, torn hs-err files - by running either out of time, stack space, or against the recursive limit - are a frequent annoyance. Note that with stack space, you won't notice unless you explicitly investigate - the hs-err file is just either abridged or not there at all, but you won't get a clear message.

I believe the problem you want to solve is real, but maybe we can find a better solution than introducing a load of new STEPs. Since by doing that you effectively reduce the chance of later STEPs running successfully. Stack and time are valuable resources all STEPs share, even if the recursive limit is a bit arbitrary. And at least for stack space, I don't see a solution - each crash will build up stack, and there is nothing to prevent this.

>
> > Therefore we limit the number of allowed recursive crashes. This limit also serves a second purpose: if we crash that often, maybe we should just stop already and let the process die.
>
> Fair, tough I am curious how we want to decide this limit, and why ~60 is fine but ~90 would be too much (I am guessing that most steps have no, or a very small possibility of crashing).

No clue. This is quite old code from Sun times, I believe.

>Maybe this should instead be solved with a general solution which stops the reporting if some retry limit is reached.

How would this be different from what we are doing now?

>
> Also the common case is that it does not crash repeatedly, and if it does, that is the scenario where I personally really would want the information, because something is seriously wrong.

I agree, but my conclusion differs from yours: if a STEP does X, and X crashes repeatedly out, it makes sense at some point to stop doing X, since the probability that it will crash again is quite high. It is better to stop and proceed with the next STEPs, since their output may be more valuable than just another failed printout in the crashing STEP.

> But maybe not at the cost of stack overflows, if it is a problem maybe some stack address limit can used to disable reentry in reentrant steps.
>
> > That brings me to the second problem, which is time. When we crash, we want to go down as fast as possible, e.g. allow a server to restart the node. OTOH we want a nice hs-err file. Therefore the time error handling is allowed to take is carefully limited. See `ErrorLogTimeout`: by default 2 Minutes, though our customers usually lower this to 30 seconds or even lower.
> > Each STEP has a timeout, set to a fraction of that total limit (A quarter). A quarter gives us room for 2-3 hanging STEPS and still leaves enough breathing room for the remainder of the STEPS.
> > If you now increase the number of STEPS, all these calculations are off. We may hit the recursive error limit much sooner, since every individual register printout may crash. And if they hang, they may eat up the ErrorLogTimeout much sooner. So we will get more torn hs-err files with "recursive limit reached, giving up" or "timeout reached, giving up".
>
> The timeout problem was something I thought about as well, and I think you are correct, and that we should treat the whole reentrant step as one timeout. (Same behaviour as before).
>
> > Note that one particularly fragile information is the printing of debug info, e.g. function name, etc. Since that relies on some parsing of debugging information. In our experience that can crash out or hang often, especially if the debug info has to be read from file or network.
>
> Alright, I see this as an argument for reentrant steps with one timeout for all iterations of the inner loop combined.
>
> I've heard opinions of something similar to reentrant steps in other parts of the hs_err printing. Like stack frame printing, where you can have iterative stages where each stage builds up more detailed information, until it crashes. And then prints what information it got so far.

I remember his discussion.

> It would be nice to know why register printing crashes, and if there is something we could do to make it more robust, like checking os::is_readable_pointer().

I agree with @dean-long, it would be good to know what is going wrong first. And harden the printing code with os::is_readable_pointer. Because SafeFetch, in contrast to this STEP mechanism, does not build up stack. Or, if applying SafeFetch is no reasonable option, retry the whole STEP with risky options disabled, as I did here: https://github.com/openjdk/jdk/pull/10179.

Cheers, Thomas

-------------

PR: https://git.openjdk.org/jdk/pull/11017