RFR(S): JDK-8137035 tests got EXCEPTION_STACK_OVERFLOW on Windows 64 bit

Mon Aug 29 01:36:56 UTC 2016

Hi Fred,

On 27/08/2016 6:00 AM, Frederic Parain wrote:
> Hi,
>
> Please review this fix for bug JDK-8137035
> The bug is confidential but it is related to several VM crashes
> that occurred on the Windows 64 bits platform in stack overflow
> conditions. I've copied/pasted the analysis of the bug and the
> description of the fix below.

The analysis and solution all seem reasonable. Though I do have to 
wonder how the failure to reenable the yellow zone when returning to 
Java would not cause far more problem, on all platforms.
>
> Webrev:
> http://cr.openjdk.java.net/~fparain/8137035/webrev.00/

src/os/windows/vm/os_windows.cpp

While examining the thread state logic in the exception handler I 
noticed some pre-existing bugs:

2506   if (exception_code == EXCEPTION_ACCESS_VIOLATION) {
2507     JavaThread* thread = (JavaThread*) t;

there is no check that t is in fact a JavaThread, or even that t is 
non-NULL. Such checks occur slightly later:

2523   if (t != NULL && t->is_Java_thread()) {
2524     JavaThread* thread = (JavaThread*) t;

This bug seems significant:

2566       if (thread->stack_guards_enabled()) {
2567         if (_thread_in_Java) {

_thread_in_Java is an enum value not a variable so we will always 
execute this block! This code should be testing the local in_java variable.

Your changes seem fine in themselves.

Thanks,
David

> Testing: JPRT (testset hotspot) and nsk.stress
>
> Thanks,
>
> Fred
>
> ---------
>
> All these crashes related to stack overflows on Windows have presumably
> the same causes:
>     - an undersized StackShadowPages parameter
>     - the behavior of guard pages on Windows
>     - a flaw in Yellow Pages management
>
> These three factors combined together can lead to sporadic crashes of
> the JVM when stack overflow conditions are encountered.
>
> All the crashes listed in this CR and in the related CR are almost
> impossible to reproduce, which indicates that the issue only shows up in
> some extreme or uncommon conditions. By design, the JVM crashes on stack
> overflow only if the Red Zone (the last one in the execution stack) is
> hit. Before the Red Zone, there's the Yellow Zone which is here to
> detect and handle stack overflows in a nicer way (throwing a
> StackOverflowError instead of crashing the process). If the Red zone is
> hit, it means that the Yellow Zone was
> disabled, and there's only two cases where the Yellow Zone is disabled:
>
>   1 - when a potential stack overflow is detected in Java code, in this
> case the Yellow Zone is disabled during the generation of the
> StackOverflowError and restored during the propagation of the
> StackOverflowError
>   2 - when a stack overflow occurs either in native code or in JVM code,
> because there's anything else the JVM can do.
>
> In several crashes, the call stack doesn't show any special recursive
> Java calls that could suggest the JVM is in case 1. But they show
> relatively complex code paths inside JVM code (de-optimization or
> class/symbol resolution), which suggests that case 2 occurred.
>
> The case of stack overflow in native code is straight forward: if the
> Yellow Zone is hit, it is disabled, but when a JavaThread returns from
> native code to Java code, the Yellow Zone is systematically re-enabled
> (this is part of the native call wrapper
> generated by the JVM).
>
> The case of stack overflow in JVM code is more problematic. The JVM
> tries to avoid the case of stack overflow in VM code with the Shadow
> Pages mechanism. Whenever a Java method is invoked, the JVM tries to
> ensure that there's enough free stack space to execute the Java method
> and *any call to the JVM code (or JDK native code) that could occur
> during the execution of this method*. This check is performed by banging
> (touching) n pages ahead on the execution stack, and n is set to
> StackShadowPages. If the Yellow Zone is hit during the stack banging, a
> StackOverflowError is thrown before the execution of the first bytecode
> of the Java method. But this mechanism assumes that StackShadowPages
> pages is big enough to cover *any call to the JVM*. If this assumption
> is wrong, so
> bad things happen.
>
> I ran experiments with tests for which stack overflow related crashes
> were reported. I ran them with a JVM where the StackShadowPages value
> was decreased by only 1 compared the usual default value. It was very
> easy to reproduce stack overflow crashes. By instrumenting the JVM, it
> appeared that some threads hit the Yellow Zone while having thread state
> _thread_in_vm. Which means that in many cases, the margin between the
> stack space provided by StackShadowPages and the real stack usage while
> executing VM code is less than one page. And because knowing the biggest
> stack requirement to execute any JVM code is an undecidable problem,
> there's a high probability that some paths require more stack space than
> StackShadowPages ensures. It is important to notice
> that Windows is the platform with the smallest default value for
> StackShadowPages.
>
> So, an undersized StackShadowPages could cause the Yellow Zone to be hit
> while executing JVM code. On Unices (Solaris, Linux, MacOSX), the
> sanction is immediate: a SIGSEGV signal is sent, but because there's no
> more free space on the execution stack, the signal handler cannot be
> executed and the JVM process is killed. It's a crash without hs_error
> file generation.
>
> On Windows, the story is different. Yellow Pages are marked with the
> "Guard" bit. When a page with a Guard bit set is touched, the current
> thread receives an exception, but before the exception handler is
> executed, the OS remove the Guard bit from the page, so the page that
> trigger the fault can be used to execute the signal handler. So on
> Windows, when the Yellow Zone is hit while executing JVM code, the JVM
> doesn't die like on Unices systems, but the signal handler is executed.
>
> The logic in the signal handler looks like this (simplified version):
>
>    if thread touches the yellow zone:
>       if thread_in_java:
>           disable yellow pages
>           jump to code throwing StackOverflowError
>           // note: yellow pages will be re-enabled
>           // while unwinding the stack
>       else:
>           // thread_in_vm or thread_in_native
>           disable yellow pages
>           resume execution
>    else:
>        // Fatal red zone violation.
>        disable red pages
>        generate VM crash
>
> So, the signal handler disable the protection of the Yellow Pages and
> resume JVM code execution.
>
> Eventually, the thread will return from the VM and will continue
> executing Java code.  But at this point, the yellow pages are still
> disabled and there's no systematic check to ensure that Yellow Pages are
> re-enabled when returning to Java. The only places where the JVM  checks
> if Yellow Pages need to be re-activated is when returning from native
> code or in the exception propagation code (but not all paths reactivate
> the Yellow Zone).
>
> Once the execution of Java code has resumed with the yellow zone
> disabled, the thread is not protected any more against stack overflows.
> The only remaining protection is the red zone, and if it is hit, the VM
> will generate a crash report and die. Note that having Yellow Zone
> de-activated makes the stack banging of StackShadowPages inefficient.
> Stack banging relies on the Yellow Pages to be activated, so touching
> them triggers a signal. If Yellow Pages are de-activated (unprotected)
> no signal is sent, unless the stack banging hits the Red Page, which
> triggers a VM crash with hs_error file generation.
>
>
> To summarize: an undersized StackShadowPages on Windows can lead to a
> JavaThread executing Java code with Yellow Pages disabled, which means
> without any stack overflow protection except the Red Zone which is the
> one triggering VM crashes with hs_error file generation.
>
> Note that the Yellow Pages can be "incidentally" re-activated by a call
> to native code  or by throwing an exception. Which could explain why
> stack overflow crashes are not so frequent, the time window during which
> Java code is executed without stack overflow protection might be small
> for some applications.
>
>
> Proposed fixes for this issue:
>   - increase StackShadowPages for the Windows platform
>   - add assertion is signal handler to detect thread hitting the Yellow
> Zone while executing JVM code (to detect undersized StackShadowPages
> during our testing)
>   - ensure Yellow Pages are activated when transitioning from
> _thread_in_vm to _thread_in_java
>