RFR(S): JDK-8137035 tests got EXCEPTION_STACK_OVERFLOW on Windows 64 bit
Frederic Parain
frederic.parain at oracle.com
Fri Aug 26 20:00:53 UTC 2016
Hi,
Please review this fix for bug JDK-8137035
The bug is confidential but it is related to several VM crashes
that occurred on the Windows 64 bits platform in stack overflow
conditions. I've copied/pasted the analysis of the bug and the
description of the fix below.
Webrev:
http://cr.openjdk.java.net/~fparain/8137035/webrev.00/
Testing: JPRT (testset hotspot) and nsk.stress
Thanks,
Fred
---------
All these crashes related to stack overflows on Windows have presumably
the same causes:
- an undersized StackShadowPages parameter
- the behavior of guard pages on Windows
- a flaw in Yellow Pages management
These three factors combined together can lead to sporadic crashes of
the JVM when stack overflow conditions are encountered.
All the crashes listed in this CR and in the related CR are almost
impossible to reproduce, which indicates that the issue only shows up in
some extreme or uncommon conditions. By design, the JVM crashes on stack
overflow only if the Red Zone (the last one in the execution stack) is
hit. Before the Red Zone, there's the Yellow Zone which is here to
detect and handle stack overflows in a nicer way (throwing a
StackOverflowError instead of crashing the process). If the Red zone is
hit, it means that the Yellow Zone was
disabled, and there's only two cases where the Yellow Zone is disabled:
1 - when a potential stack overflow is detected in Java code, in this
case the Yellow Zone is disabled during the generation of the
StackOverflowError and restored during the propagation of the
StackOverflowError
2 - when a stack overflow occurs either in native code or in JVM
code, because there's anything else the JVM can do.
In several crashes, the call stack doesn't show any special recursive
Java calls that could suggest the JVM is in case 1. But they show
relatively complex code paths inside JVM code (de-optimization or
class/symbol resolution), which suggests that case 2 occurred.
The case of stack overflow in native code is straight forward: if the
Yellow Zone is hit, it is disabled, but when a JavaThread returns from
native code to Java code, the Yellow Zone is systematically re-enabled
(this is part of the native call wrapper
generated by the JVM).
The case of stack overflow in JVM code is more problematic. The JVM
tries to avoid the case of stack overflow in VM code with the Shadow
Pages mechanism. Whenever a Java method is invoked, the JVM tries to
ensure that there's enough free stack space to execute the Java method
and *any call to the JVM code (or JDK native code) that could occur
during the execution of this method*. This check is performed by banging
(touching) n pages ahead on the execution stack, and n is set to
StackShadowPages. If the Yellow Zone is hit during the stack banging, a
StackOverflowError is thrown before the execution of the first bytecode
of the Java method. But this mechanism assumes that StackShadowPages
pages is big enough to cover *any call to the JVM*. If this assumption
is wrong, so
bad things happen.
I ran experiments with tests for which stack overflow related crashes
were reported. I ran them with a JVM where the StackShadowPages value
was decreased by only 1 compared the usual default value. It was very
easy to reproduce stack overflow crashes. By instrumenting the JVM, it
appeared that some threads hit the Yellow Zone while having thread state
_thread_in_vm. Which means that in many cases, the margin between the
stack space provided by StackShadowPages and the real stack usage while
executing VM code is less than one page. And because knowing the biggest
stack requirement to execute any JVM code is an undecidable problem,
there's a high probability that some paths require more stack space than
StackShadowPages ensures. It is important to notice
that Windows is the platform with the smallest default value for
StackShadowPages.
So, an undersized StackShadowPages could cause the Yellow Zone to be hit
while executing JVM code. On Unices (Solaris, Linux, MacOSX), the
sanction is immediate: a SIGSEGV signal is sent, but because there's no
more free space on the execution stack, the signal handler cannot be
executed and the JVM process is killed. It's a crash without hs_error
file generation.
On Windows, the story is different. Yellow Pages are marked with the
"Guard" bit. When a page with a Guard bit set is touched, the current
thread receives an exception, but before the exception handler is
executed, the OS remove the Guard bit from the page, so the page that
trigger the fault can be used to execute the signal handler. So on
Windows, when the Yellow Zone is hit while executing JVM code, the JVM
doesn't die like on Unices systems, but the signal handler is executed.
The logic in the signal handler looks like this (simplified version):
if thread touches the yellow zone:
if thread_in_java:
disable yellow pages
jump to code throwing StackOverflowError
// note: yellow pages will be re-enabled
// while unwinding the stack
else:
// thread_in_vm or thread_in_native
disable yellow pages
resume execution
else:
// Fatal red zone violation.
disable red pages
generate VM crash
So, the signal handler disable the protection of the Yellow Pages and
resume JVM code execution.
Eventually, the thread will return from the VM and will continue
executing Java code. But at this point, the yellow pages are still
disabled and there's no systematic check to ensure that Yellow Pages are
re-enabled when returning to Java. The only places where the JVM checks
if Yellow Pages need to be re-activated is when returning from native
code or in the exception propagation code (but not all paths reactivate
the Yellow Zone).
Once the execution of Java code has resumed with the yellow zone
disabled, the thread is not protected any more against stack overflows.
The only remaining protection is the red zone, and if it is hit, the VM
will generate a crash report and die. Note that having Yellow Zone
de-activated makes the stack banging of StackShadowPages inefficient.
Stack banging relies on the Yellow Pages to be activated, so touching
them triggers a signal. If Yellow Pages are de-activated (unprotected)
no signal is sent, unless the stack banging hits the Red Page, which
triggers a VM crash with hs_error file generation.
To summarize: an undersized StackShadowPages on Windows can lead to a
JavaThread executing Java code with Yellow Pages disabled, which means
without any stack overflow protection except the Red Zone which is the
one triggering VM crashes with hs_error file generation.
Note that the Yellow Pages can be "incidentally" re-activated by a call
to native code or by throwing an exception. Which could explain why
stack overflow crashes are not so frequent, the time window during which
Java code is executed without stack overflow protection might be small
for some applications.
Proposed fixes for this issue:
- increase StackShadowPages for the Windows platform
- add assertion is signal handler to detect thread hitting the Yellow
Zone while executing JVM code (to detect undersized StackShadowPages
during our testing)
- ensure Yellow Pages are activated when transitioning from
_thread_in_vm to _thread_in_java
More information about the hotspot-runtime-dev
mailing list