RFR(S): JDK-8137035 tests got EXCEPTION_STACK_OVERFLOW on Windows 64 bit
David Holmes
david.holmes at oracle.com
Mon Aug 29 01:36:56 UTC 2016
Hi Fred,
On 27/08/2016 6:00 AM, Frederic Parain wrote:
> Hi,
>
> Please review this fix for bug JDK-8137035
> The bug is confidential but it is related to several VM crashes
> that occurred on the Windows 64 bits platform in stack overflow
> conditions. I've copied/pasted the analysis of the bug and the
> description of the fix below.
The analysis and solution all seem reasonable. Though I do have to
wonder how the failure to reenable the yellow zone when returning to
Java would not cause far more problem, on all platforms.
>
> Webrev:
> http://cr.openjdk.java.net/~fparain/8137035/webrev.00/
src/os/windows/vm/os_windows.cpp
While examining the thread state logic in the exception handler I
noticed some pre-existing bugs:
2506 if (exception_code == EXCEPTION_ACCESS_VIOLATION) {
2507 JavaThread* thread = (JavaThread*) t;
there is no check that t is in fact a JavaThread, or even that t is
non-NULL. Such checks occur slightly later:
2523 if (t != NULL && t->is_Java_thread()) {
2524 JavaThread* thread = (JavaThread*) t;
This bug seems significant:
2566 if (thread->stack_guards_enabled()) {
2567 if (_thread_in_Java) {
_thread_in_Java is an enum value not a variable so we will always
execute this block! This code should be testing the local in_java variable.
Your changes seem fine in themselves.
Thanks,
David
> Testing: JPRT (testset hotspot) and nsk.stress
>
> Thanks,
>
> Fred
>
> ---------
>
> All these crashes related to stack overflows on Windows have presumably
> the same causes:
> - an undersized StackShadowPages parameter
> - the behavior of guard pages on Windows
> - a flaw in Yellow Pages management
>
> These three factors combined together can lead to sporadic crashes of
> the JVM when stack overflow conditions are encountered.
>
> All the crashes listed in this CR and in the related CR are almost
> impossible to reproduce, which indicates that the issue only shows up in
> some extreme or uncommon conditions. By design, the JVM crashes on stack
> overflow only if the Red Zone (the last one in the execution stack) is
> hit. Before the Red Zone, there's the Yellow Zone which is here to
> detect and handle stack overflows in a nicer way (throwing a
> StackOverflowError instead of crashing the process). If the Red zone is
> hit, it means that the Yellow Zone was
> disabled, and there's only two cases where the Yellow Zone is disabled:
>
> 1 - when a potential stack overflow is detected in Java code, in this
> case the Yellow Zone is disabled during the generation of the
> StackOverflowError and restored during the propagation of the
> StackOverflowError
> 2 - when a stack overflow occurs either in native code or in JVM code,
> because there's anything else the JVM can do.
>
> In several crashes, the call stack doesn't show any special recursive
> Java calls that could suggest the JVM is in case 1. But they show
> relatively complex code paths inside JVM code (de-optimization or
> class/symbol resolution), which suggests that case 2 occurred.
>
> The case of stack overflow in native code is straight forward: if the
> Yellow Zone is hit, it is disabled, but when a JavaThread returns from
> native code to Java code, the Yellow Zone is systematically re-enabled
> (this is part of the native call wrapper
> generated by the JVM).
>
> The case of stack overflow in JVM code is more problematic. The JVM
> tries to avoid the case of stack overflow in VM code with the Shadow
> Pages mechanism. Whenever a Java method is invoked, the JVM tries to
> ensure that there's enough free stack space to execute the Java method
> and *any call to the JVM code (or JDK native code) that could occur
> during the execution of this method*. This check is performed by banging
> (touching) n pages ahead on the execution stack, and n is set to
> StackShadowPages. If the Yellow Zone is hit during the stack banging, a
> StackOverflowError is thrown before the execution of the first bytecode
> of the Java method. But this mechanism assumes that StackShadowPages
> pages is big enough to cover *any call to the JVM*. If this assumption
> is wrong, so
> bad things happen.
>
> I ran experiments with tests for which stack overflow related crashes
> were reported. I ran them with a JVM where the StackShadowPages value
> was decreased by only 1 compared the usual default value. It was very
> easy to reproduce stack overflow crashes. By instrumenting the JVM, it
> appeared that some threads hit the Yellow Zone while having thread state
> _thread_in_vm. Which means that in many cases, the margin between the
> stack space provided by StackShadowPages and the real stack usage while
> executing VM code is less than one page. And because knowing the biggest
> stack requirement to execute any JVM code is an undecidable problem,
> there's a high probability that some paths require more stack space than
> StackShadowPages ensures. It is important to notice
> that Windows is the platform with the smallest default value for
> StackShadowPages.
>
> So, an undersized StackShadowPages could cause the Yellow Zone to be hit
> while executing JVM code. On Unices (Solaris, Linux, MacOSX), the
> sanction is immediate: a SIGSEGV signal is sent, but because there's no
> more free space on the execution stack, the signal handler cannot be
> executed and the JVM process is killed. It's a crash without hs_error
> file generation.
>
> On Windows, the story is different. Yellow Pages are marked with the
> "Guard" bit. When a page with a Guard bit set is touched, the current
> thread receives an exception, but before the exception handler is
> executed, the OS remove the Guard bit from the page, so the page that
> trigger the fault can be used to execute the signal handler. So on
> Windows, when the Yellow Zone is hit while executing JVM code, the JVM
> doesn't die like on Unices systems, but the signal handler is executed.
>
> The logic in the signal handler looks like this (simplified version):
>
> if thread touches the yellow zone:
> if thread_in_java:
> disable yellow pages
> jump to code throwing StackOverflowError
> // note: yellow pages will be re-enabled
> // while unwinding the stack
> else:
> // thread_in_vm or thread_in_native
> disable yellow pages
> resume execution
> else:
> // Fatal red zone violation.
> disable red pages
> generate VM crash
>
> So, the signal handler disable the protection of the Yellow Pages and
> resume JVM code execution.
>
> Eventually, the thread will return from the VM and will continue
> executing Java code. But at this point, the yellow pages are still
> disabled and there's no systematic check to ensure that Yellow Pages are
> re-enabled when returning to Java. The only places where the JVM checks
> if Yellow Pages need to be re-activated is when returning from native
> code or in the exception propagation code (but not all paths reactivate
> the Yellow Zone).
>
> Once the execution of Java code has resumed with the yellow zone
> disabled, the thread is not protected any more against stack overflows.
> The only remaining protection is the red zone, and if it is hit, the VM
> will generate a crash report and die. Note that having Yellow Zone
> de-activated makes the stack banging of StackShadowPages inefficient.
> Stack banging relies on the Yellow Pages to be activated, so touching
> them triggers a signal. If Yellow Pages are de-activated (unprotected)
> no signal is sent, unless the stack banging hits the Red Page, which
> triggers a VM crash with hs_error file generation.
>
>
> To summarize: an undersized StackShadowPages on Windows can lead to a
> JavaThread executing Java code with Yellow Pages disabled, which means
> without any stack overflow protection except the Red Zone which is the
> one triggering VM crashes with hs_error file generation.
>
> Note that the Yellow Pages can be "incidentally" re-activated by a call
> to native code or by throwing an exception. Which could explain why
> stack overflow crashes are not so frequent, the time window during which
> Java code is executed without stack overflow protection might be small
> for some applications.
>
>
> Proposed fixes for this issue:
> - increase StackShadowPages for the Windows platform
> - add assertion is signal handler to detect thread hitting the Yellow
> Zone while executing JVM code (to detect undersized StackShadowPages
> during our testing)
> - ensure Yellow Pages are activated when transitioning from
> _thread_in_vm to _thread_in_java
>
More information about the hotspot-runtime-dev
mailing list