RFR: 8296469: Instrument VMError::report with reentrant iteration step for register and stack printing

Tue Nov 8 09:30:28 UTC 2022

On Tue, 8 Nov 2022 07:17:12 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

> Each time we crash out in VMError, we build up the stack. We never ever unwind that stack, eg. via longjmp, since that would introduce other errors (e.g. abandoned locks). There is a natural limit to how many recursive crashes we can handle since the stack is not endless. Each secondary crash increases the risk of running into guard pages and spoiling the game for follow-up STEPs.

I have no experience with stack depth being the problem in crashes (with the caveat that I have only run with this patch for a few weeks), but have experienced cases where only having a hs_err file available and having the register print_location bailing out early, where missing printing the rest has been unfavourable. 

> Therefore we limit the number of allowed recursive crashes. This limit also serves a second purpose: if we crash that often, maybe we should just stop already and let the process die.

Fair, tough I am curious how we want to decide this limit, and why ~60 is fine but ~90 would be too much (I am guessing that most steps have no, or a very small possibility of crashing). Maybe this should instead be solved with a general solution which stops the reporting if some retry limit is reached.  

Also the common case is that it does not crash repeatedly, and if it does, that is the scenario where I personally really would want the information, because something is seriously wrong. But maybe not at the cost of stack overflows, if it is a problem maybe some stack address limit can used to disable reentry in reentrant steps. 

> That brings me to the second problem, which is time. When we crash, we want to go down as fast as possible, e.g. allow a server to restart the node. OTOH we want a nice hs-err file. Therefore the time error handling is allowed to take is carefully limited. See `ErrorLogTimeout`: by default 2 Minutes, though our customers usually lower this to 30 seconds or even lower.
> 
> Each STEP has a timeout, set to a fraction of that total limit (A quarter). A quarter gives us room for 2-3 hanging STEPS and still leaves enough breathing room for the remainder of the STEPS.
> 
> If you now increase the number of STEPS, all these calculations are off. We may hit the recursive error limit much sooner, since every individual register printout may crash. And if they hang, they may eat up the ErrorLogTimeout much sooner. So we will get more torn hs-err files with "recursive limit reached, giving up" or "timeout reached, giving up".

The timeout problem was something I thought about as well, and I think you are correct, and that we should treat the whole reentrant step as one timeout. (Same behaviour as before).

> Note that one particularly fragile information is the printing of debug info, e.g. function name, etc. Since that relies on some parsing of debugging information. In our experience that can crash out or hang often, especially if the debug info has to be read from file or network.
> 
Alright, I see this as an argument for reentrant steps with one timeout for all iterations of the inner loop combined. 

I've heard opinions of something similar to reentrant steps in other parts of the hs_err printing. Like stack frame printing, where you can have iterative stages where each stage builds up more detailed information, until it crashes. And then prints what information it got so far.

-------------

PR: https://git.openjdk.org/jdk/pull/11017