RFR: 8373128: Stack overflow handling for native stack overflows

Mon Mar 2 05:40:19 UTC 2026

On Wed, 4 Feb 2026 07:19:03 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

> Still Draft, pls ignore for now. Patch is not done yet.
> 
> This patch enables hs-err file generation for native out-of-stack cases. It is an optional analysis feature one can use when JVMs mysteriously vanish - typically, vanishing JVMs are either native stack overflows or OOM kills.
> 
> This was motivated by the analysis difficulties of bugs like https://bugs.openjdk.org/browse/JDK-8371630. There are many more examples.
> 
> ### Motivation
> 
> Today, when native stack overflows, the JVM dies immediately without an hs-err file. This is because C++-compiled code does not bang - if the stack is too small, we walk right into whatever caps the stack. That might be our own yellow/red guard pages, native guard pages placed by libc or kernel, or possibly unmapped area after the end of the stack. 
> 
> Since we don't have a stack left to run the signal handler on, we cannot produce the hs-err file. If one is very lucky, the libc writes a short "Stack overflow" to stderr. But usually not: if it is a JavaThread and we run into our own yellow/red pages, it counts as a simple segmentation fault from the OS's point of view, since the fault address is inside of what it thinks is a valid pthread stack. So, typically, you just see "Segmentation fault" on stderr.
> 
> ***Why do we need this patch? Don't we bang enough space for native code we call?***
> 
> We bang when entering a native function from Java. The maximum stack size we assume at that time might not be enough; moreover, the native code may be buggy or just too deeply or infinitely recursive. 
> 
> ***We could just increase `ShadowPages`, right?***
> 
> Sure, but the point is we have no hs-err file, so we don't even know it was a stack overflow. One would have to start debugging, which is work-intensive and may not even be possible in a customer scenario. And for buggy recursive code, any `ShadowPages` value might be too small. The code would need to be fixed.
> 
> ### Implementation
> 
> The patch uses alternative signal stacks. That is a simple, robust solution with few moving parts. It works out of the box for all cases: 
> - Stack overflows inside native JNI code from Java 
> - Stack overflows inside Hotspot-internal JavaThread children (e.g. CompilerThread, AttachListenerThread etc)
> - Stack overflows in non-Java threads (e.g. VMThread, ConcurrentGCThread)
> - Stack overflows in outside threads that are attached to the JVM, e.g. third-party JVMTI threads
> 
> The drawback of this simplicity is that it is not suitable for always-on production use. That is du...

This also seems to break the SA core dump analysis when used. The main thread (that triggers the crash for core dump analysis) gives:

sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp
	at jdk.hotspot.agent/sun.jvm.hotspot.debugger.bsd.BsdDebuggerLocal.getThreadIntegerRegisterSet0(Native Method)
	at jdk.hotspot.agent/sun.jvm.hotspot.debugger.bsd.BsdDebuggerLocal.getThreadIntegerRegisterSet(BsdDebuggerLocal.java:472)
	at jdk.hotspot.agent/sun.jvm.hotspot.debugger.bsd.BsdThread.getContext(BsdThread.java:68)
	at jdk.hotspot.agent/sun.jvm.hotspot.debugger.bsd.BsdCDebugger.topFrameForThread(BsdCDebugger.java:88)
	at jdk.hotspot.agent/sun.jvm.hotspot.tools.PStack.run(PStack.java:103)

And if we do trigger a crash whilst on the alt-stack, we can't report it properly:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S104193/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/c9064ba7-6869-444d-8711-6d351e4aef68/runs/13a2783a-fcf1-4587-9b41-fa976ba25ed0/workspace/open/src/hotspot/share/runtime/thread.hpp:469), pid=61163, tid=25603
#  assert(stack_base() > limit && limit >= stack_end()) failed: limit is outside of stack
#
...
---------------  T H R E A D  ---------------

Current thread (0x000000013d057e10):  
[error occurred during error reporting (printing current thread), id 0xe0000000, Internal Error (/System/Volumes/Data/mesos/work_dir/slaves/da1065b5-7b94-4f0d-85e9-a3a252b9a32e-S104193/frameworks/1735e8a2-a1db-478c-8104-60c8b0af87dd-0196/executors/c9064ba7-6869-444d-8711-6d351e4aef68/runs/13a2783a-fcf1-4587-9b41-fa976ba25ed0/workspace/open/src/hotspot/share/runtime/thread.hpp:469)]
Stack: [0x000000016ce30000,0x000000016ceb3000],  sp=0x0000000105d89920 **OUTSIDE STACK**.
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.dylib+0x1250500]  VMError::report(outputStream*, bool)+0x1d1c  (thread.hpp:469)

Interestingly a few tests seem to fail only on macOS.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/29559#issuecomment-3982244026
PR Comment: https://git.openjdk.org/jdk/pull/29559#issuecomment-3982248264