RFR: 8296469: Instrument VMError::report with reentrant iteration step for register and stack printing

Tue Nov 8 07:19:26 UTC 2022

On Mon, 7 Nov 2022 13:24:26 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:

> Add reentrant step logic to VMError::report with an inner loop which enable the logic to recover at every step of the iteration.
> 
> Before this change, if printing one register/stack position crashes then no more registers/stack positions will be printed.
> 
> After this change even if the VM is unstable and some registers print_location crashes the hs_err printing will recover and keep attempting to print the rest of the registers or stack values.
> 
> Enables the following
> ```C++
> REENTRANT_STEP_IF("printing register info", _verbose && _context && _thread && Universe::is_fully_initialized())
>   os::print_register_info_header(st, _context);
> 
>   REENTRANT_LOOP_START(os::print_nth_register_info_max_index())
>     // decode register contents if possible
>     ResourceMark rm(_thread);
>     os::print_nth_register_info(st, REENTRANT_ITERATION_STEP, _context);
>   REENTRANT_LOOP_END
> 
>   st->cr();
> 
> 
> Testing: tier 1 and compiled Linux-x64/aarch64, MacOS-x64/aarch64, Windows x64 and cross-compiled Linux-x86/riscv/arm/ppc/s390x (GHA and some local)

Hi Axel,

I am not sure this is a good idea tbh, for two reasons:

Each time we crash out in VMError, we build up the stack. We never ever unwind that stack, eg. via longjmp, since that would introduce other errors (e.g. abandoned locks). There is a natural limit to how many recursive crashes we can handle since the stack is not endless. Each secondary crash increases the risk of running into guard pages and spoiling the game for follow-up STEPs.

Therefore we limit the number of allowed recursive crashes. This limit also serves a second purpose: if we crash that often, maybe we should just stop already and let the process die.

That brings me to the second problem, which is time. When we crash, we want to go down as fast as possible, e.g. allow a server to restart the node. OTOH we want a nice hs-err file. Therefore the time error handling is allowed to take is carefully limited. See `ErrorLogTimeout`: by default 2 Minutes, though our customers usually lower this to 30 seconds or even lower.

Each STEP has a timeout, set to a fraction of that total limit (A quarter). A quarter gives us room for 2-3 hanging STEPS and still leaves enough breathing room for the remainder of the STEPS.

If you now increase the number of STEPS, all these calculations are off. We may hit the recursive error limit much sooner, since every individual register printout may crash. And if they hang, they may eat up the ErrorLogTimeout much sooner. So we will get more torn hs-err files with "recursive limit reached, giving up" or "timeout reached, giving up".

Note that one particularly fragile information is the printing of debug info, e.g. function name, etc. Since that relies on some parsing of debugging information. In our experience that can crash out or hang often, especially if the debug info has to be read from file or network.

Cheers, Thomas

-------------

PR: https://git.openjdk.org/jdk/pull/11017