RFR: 8296469: Instrument VMError::report with reentrant iteration step for register and stack printing [v9]
Thomas Stuefe
stuefe at openjdk.org
Fri May 5 14:30:34 UTC 2023
On Mon, 20 Feb 2023 07:15:23 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:
>> Add reentrant step logic to VMError::report with an inner loop which enable the logic to recover at every step of the iteration.
>>
>> Before this change, if printing one register/stack position crashes then no more registers/stack positions will be printed.
>>
>> After this change even if the VM is unstable and some registers print_location crashes the hs_err printing will recover and keep attempting to print the rest of the registers or stack values.
>>
>> Enables the following
>> ```C++
>> REENTRANT_STEP_IF("printing register info", _verbose && _context && _thread && Universe::is_fully_initialized())
>> os::print_register_info_header(st, _context);
>>
>> REENTRANT_LOOP_START(os::print_nth_register_info_max_index())
>> // decode register contents if possible
>> ResourceMark rm(_thread);
>> os::print_nth_register_info(st, REENTRANT_ITERATION_STEP, _context);
>> REENTRANT_LOOP_END
>>
>> st->cr();
>>
>>
>> Testing: tier 1 and compiled Linux-x64/aarch64, MacOS-x64/aarch64, Windows x64 and cross-compiled Linux-x86/riscv/arm/ppc/s390x (GHA and some local)
>
> Axel Boldt-Christmas has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits:
>
> - Merge remote-tracking branch 'upstream_jdk/master' into vmerror_report_register_stack_reentrant
> - Add test
> - Fix and strengthen print_stack_location
> - Missed variable rename
> - Copyright
> - Rework logic and use continuation state for reattempts
> - Merge remote-tracking branch 'upstream_jdk/master' into vmerror_report_register_stack_reentrant
> - Restructure os::print_register_info interface
> - Code syle and line length
> - Merge Fix
> - ... and 5 more: https://git.openjdk.org/jdk/compare/2009dc2b...2e12b4a5
src/hotspot/share/utilities/vmError.cpp line 173:
> 171: }
> 172:
> 173: static bool check_stack_headroom(Thread* thread,
Could you please write a short comment what the return means? From the code, I assume true means "not enough headroom"? Maybe rename function to "stack_has_headroom"?
src/hotspot/share/utilities/vmError.cpp line 180:
> 178: static const size_t stack_size = thread != nullptr
> 179: ? thread->stack_size()
> 180: : os::current_stack_size();
Why even bother with thread? If you want this to work without Thread*, you may just as well just use the os::current_stack_xxx() functions.
OTOH I think it would also be perfectly acceptable to just use Thread*, since we have that in 99% of cases, and if Thread is null to assume that we have enough headroom.
src/hotspot/share/utilities/vmError.cpp line 187:
> 185: const ptrdiff_t stack_headroom = stack_pointer - stack_bottom;
> 186: return (stack_pointer < stack_bottom || stack_headroom < 0 ||
> 187: static_cast<size_t>(stack_headroom) < headroom);
Could be shortened. E.g. `return stack_pointer - headroom < stack_bottom` ?
src/hotspot/share/utilities/vmError.cpp line 194:
> 192: if (!check_stack_headroom(_thread, _reattempt_required_stack_headroom)) {
> 193: char stack_buffer[_reattempt_required_stack_headroom / 2];
> 194: static_cast<void>(stack_buffer[sizeof(stack_buffer) - 1] = '\0');
I would alloca() here instead of the array. I assume the touch at the end is to prevent the compiler from optimizing this away? With alloca you don't need that. No need for recursion either then, you can do that in a loop.
src/hotspot/share/utilities/vmError.cpp line 201:
> 199: #endif // ASSERT
> 200:
> 201: bool VMError::should_stop_reattempt_step(const char* &reason) {
I had to read this twice to see the "stop" in the name :-)
I would prefer the logic to be inverse and this function to be named "can_reattempt_step". But since this is a matter of taste, I leave it up to you.
src/hotspot/share/utilities/vmError.cpp line 476:
> 474: continuation = i + 1;
> 475: const frame fr = os::fetch_frame_from_context(context);
> 476: while (i < 8) {
Can we name this constant (function scope const is fine, something like "number_of_stack_slots" or so).
src/hotspot/share/utilities/vmError.cpp line 643:
> 641: # define REATTEMPT_STEP_WITH_NEW_TIMEOUT_IF(s, cond) \
> 642: REATTEMPT_STEP_IF_IMPL(s, cond, true)
> 643:
I'm doubtful about the reset-timeout feature. If something timeouts, the chance is very high it will timeout again. Either because we have a deadlock, or because what we do is simply very slow. One example for very slow is printing callstacks - decoding debug info can be very slow if debug info is loaded e.g. from network share, but it will not get any faster by repeating the attempt.
With crashes related to printing registers and stack slots, I can see the sense and usefulness of reattempts. But timeouts are both more "sticky" (high chance of happening again) as well as worse than crashes. Customers want the crashing VM to be down quickly, to release all locks and files, so that the replacement VM can start up.
So maybe we should scrap the new timeout feature. Would also simplify coding a bit.
-------------
PR Review: https://git.openjdk.org/jdk/pull/11017#pullrequestreview-1414850044
PR Review Comment: https://git.openjdk.org/jdk/pull/11017#discussion_r1186142343
PR Review Comment: https://git.openjdk.org/jdk/pull/11017#discussion_r1186137457
PR Review Comment: https://git.openjdk.org/jdk/pull/11017#discussion_r1186143933
PR Review Comment: https://git.openjdk.org/jdk/pull/11017#discussion_r1186130553
PR Review Comment: https://git.openjdk.org/jdk/pull/11017#discussion_r1186152558
PR Review Comment: https://git.openjdk.org/jdk/pull/11017#discussion_r1186154265
PR Review Comment: https://git.openjdk.org/jdk/pull/11017#discussion_r1186163554
More information about the hotspot-dev
mailing list