RFR: 8373128: Stack overflow handling for native stack overflows
Thomas Stuefe
stuefe at openjdk.org
Sat Feb 28 13:25:52 UTC 2026
On Wed, 4 Feb 2026 07:19:03 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:
> Still Draft, pls ignore for now. Patch is not done yet.
>
> This patch enables hs-err file generation for native out-of-stack cases. It is an optional analysis feature one can use when JVMs mysteriously vanish - typically, vanishing JVMs are either native stack overflows or OOM kills.
>
> This was motivated by the analysis difficulties of bugs like https://bugs.openjdk.org/browse/JDK-8371630. There are many more examples.
>
> ### Motivation
>
> Today, when native stack overflows, the JVM dies immediately without an hs-err file. This is because C++-compiled code does not bang - if the stack is too small, we walk right into whatever caps the stack. That might be our own yellow/red guard pages, native guard pages placed by libc or kernel, or possibly unmapped area after the end of the stack.
>
> Since we don't have a stack left to run the signal handler on, we cannot produce the hs-err file. If one is very lucky, the libc writes a short "Stack overflow" to stderr. But usually not: if it is a JavaThread and we run into our own yellow/red pages, it counts as a simple segmentation fault from the OS's point of view, since the fault address is inside of what it thinks is a valid pthread stack. So, typically, you just see "Segmentation fault" on stderr.
>
> ***Why do we need this patch? Don't we bang enough space for native code we call?***
>
> We bang when entering a native function from Java. The maximum stack size we assume at that time might not be enough; moreover, the native code may be buggy or just too deeply or infinitely recursive.
>
> ***We could just increase `ShadowPages`, right?***
>
> Sure, but the point is we have no hs-err file, so we don't even know it was a stack overflow. One would have to start debugging, which is work-intensive and may not even be possible in a customer scenario. And for buggy recursive code, any `ShadowPages` value might be too small. The code would need to be fixed.
>
> ### Implementation
>
> The patch uses alternative signal stacks. That is a simple, robust solution with few moving parts. It works out of the box for all cases:
> - Stack overflows inside native JNI code from Java
> - Stack overflows inside Hotspot-internal JavaThread children (e.g. CompilerThread, AttachListenerThread etc)
> - Stack overflows in non-Java threads (e.g. VMThread, ConcurrentGCThread)
> - Stack overflows in outside threads that are attached to the JVM, e.g. third-party JVMTI threads
>
> The drawback of this simplicity is that it is not suitable for always-on production use. That is du...
src/hotspot/os/posix/os_posix.cpp line 1326:
> 1324:
> 1325: // There is no point in continuing.
> 1326: VMError::report_and_die(thread, info->si_signo, pc, info, ucVoid, "irrecoverable stack overflow");
Reviewer info: signal handling flow (all "xxx_handle" functions) only allows to return true ("stop signal handling and return from handler") or false ("continue signal handling and look who else could handle this signal"). I did not feel like expanding the fix here to change that, hence the direct error reporting call.
When a handler function (like here) finds a problem that cannot be fixed, we want to prevent the error handling from attempting to match other signal handling methods onto this problem and possibly produce a confusing mismatch. We want a clean error report at this exact point.
src/hotspot/share/utilities/vmError.cpp line 1635:
> 1633: fprintf(stderr, "signaled: %s", os::exception_name(sig, tmp, sizeof(tmp)));
> 1634: }
> 1635:
Reviewer info: code moved here to be applicable for gtests
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/29559#discussion_r2762592564
PR Review Comment: https://git.openjdk.org/jdk/pull/29559#discussion_r2762577177
More information about the hotspot-dev
mailing list