RFR(S): JDK-8137035 tests got EXCEPTION_STACK_OVERFLOW on Windows 64 bit

Fri Aug 26 20:00:53 UTC 2016

Hi,

Please review this fix for bug JDK-8137035
The bug is confidential but it is related to several VM crashes
that occurred on the Windows 64 bits platform in stack overflow
conditions. I've copied/pasted the analysis of the bug and the
description of the fix below.

Webrev:
http://cr.openjdk.java.net/~fparain/8137035/webrev.00/

Testing: JPRT (testset hotspot) and nsk.stress

Thanks,

Fred

---------

All these crashes related to stack overflows on Windows have presumably 
the same causes:
     - an undersized StackShadowPages parameter
     - the behavior of guard pages on Windows
     - a flaw in Yellow Pages management

These three factors combined together can lead to sporadic crashes of 
the JVM when stack overflow conditions are encountered.

All the crashes listed in this CR and in the related CR are almost 
impossible to reproduce, which indicates that the issue only shows up in 
some extreme or uncommon conditions. By design, the JVM crashes on stack 
overflow only if the Red Zone (the last one in the execution stack) is 
hit. Before the Red Zone, there's the Yellow Zone which is here to 
detect and handle stack overflows in a nicer way (throwing a 
StackOverflowError instead of crashing the process). If the Red zone is 
hit, it means that the Yellow Zone was
disabled, and there's only two cases where the Yellow Zone is disabled:

   1 - when a potential stack overflow is detected in Java code, in this 
case the Yellow Zone is disabled during the generation of the 
StackOverflowError and restored during the propagation of the 
StackOverflowError
   2 - when a stack overflow occurs either in native code or in JVM 
code, because there's anything else the JVM can do.

In several crashes, the call stack doesn't show any special recursive 
Java calls that could suggest the JVM is in case 1. But they show 
relatively complex code paths inside JVM code (de-optimization or 
class/symbol resolution), which suggests that case 2 occurred.

The case of stack overflow in native code is straight forward: if the 
Yellow Zone is hit, it is disabled, but when a JavaThread returns from 
native code to Java code, the Yellow Zone is systematically re-enabled 
(this is part of the native call wrapper
generated by the JVM).

The case of stack overflow in JVM code is more problematic. The JVM 
tries to avoid the case of stack overflow in VM code with the Shadow 
Pages mechanism. Whenever a Java method is invoked, the JVM tries to 
ensure that there's enough free stack space to execute the Java method 
and *any call to the JVM code (or JDK native code) that could occur 
during the execution of this method*. This check is performed by banging 
(touching) n pages ahead on the execution stack, and n is set to 
StackShadowPages. If the Yellow Zone is hit during the stack banging, a 
StackOverflowError is thrown before the execution of the first bytecode 
of the Java method. But this mechanism assumes that StackShadowPages 
pages is big enough to cover *any call to the JVM*. If this assumption 
is wrong, so
bad things happen.

I ran experiments with tests for which stack overflow related crashes 
were reported. I ran them with a JVM where the StackShadowPages value 
was decreased by only 1 compared the usual default value. It was very 
easy to reproduce stack overflow crashes. By instrumenting the JVM, it 
appeared that some threads hit the Yellow Zone while having thread state 
_thread_in_vm. Which means that in many cases, the margin between the 
stack space provided by StackShadowPages and the real stack usage while 
executing VM code is less than one page. And because knowing the biggest 
stack requirement to execute any JVM code is an undecidable problem, 
there's a high probability that some paths require more stack space than 
StackShadowPages ensures. It is important to notice
that Windows is the platform with the smallest default value for 
StackShadowPages.

So, an undersized StackShadowPages could cause the Yellow Zone to be hit 
while executing JVM code. On Unices (Solaris, Linux, MacOSX), the 
sanction is immediate: a SIGSEGV signal is sent, but because there's no 
more free space on the execution stack, the signal handler cannot be 
executed and the JVM process is killed. It's a crash without hs_error 
file generation.

On Windows, the story is different. Yellow Pages are marked with the 
"Guard" bit. When a page with a Guard bit set is touched, the current 
thread receives an exception, but before the exception handler is 
executed, the OS remove the Guard bit from the page, so the page that 
trigger the fault can be used to execute the signal handler. So on 
Windows, when the Yellow Zone is hit while executing JVM code, the JVM 
doesn't die like on Unices systems, but the signal handler is executed.

The logic in the signal handler looks like this (simplified version):

    if thread touches the yellow zone:
       if thread_in_java:
           disable yellow pages
           jump to code throwing StackOverflowError
           // note: yellow pages will be re-enabled
           // while unwinding the stack
       else:
           // thread_in_vm or thread_in_native
           disable yellow pages
           resume execution
    else:
        // Fatal red zone violation.
        disable red pages
        generate VM crash

So, the signal handler disable the protection of the Yellow Pages and 
resume JVM code execution.

Eventually, the thread will return from the VM and will continue 
executing Java code.  But at this point, the yellow pages are still 
disabled and there's no systematic check to ensure that Yellow Pages are 
re-enabled when returning to Java. The only places where the JVM  checks 
if Yellow Pages need to be re-activated is when returning from native 
code or in the exception propagation code (but not all paths reactivate 
the Yellow Zone).

Once the execution of Java code has resumed with the yellow zone 
disabled, the thread is not protected any more against stack overflows. 
The only remaining protection is the red zone, and if it is hit, the VM 
will generate a crash report and die. Note that having Yellow Zone 
de-activated makes the stack banging of StackShadowPages inefficient. 
Stack banging relies on the Yellow Pages to be activated, so touching 
them triggers a signal. If Yellow Pages are de-activated (unprotected) 
no signal is sent, unless the stack banging hits the Red Page, which 
triggers a VM crash with hs_error file generation.

To summarize: an undersized StackShadowPages on Windows can lead to a 
JavaThread executing Java code with Yellow Pages disabled, which means 
without any stack overflow protection except the Red Zone which is the 
one triggering VM crashes with hs_error file generation.

Note that the Yellow Pages can be "incidentally" re-activated by a call 
to native code  or by throwing an exception. Which could explain why 
stack overflow crashes are not so frequent, the time window during which 
Java code is executed without stack overflow protection might be small 
for some applications.

Proposed fixes for this issue:
   - increase StackShadowPages for the Windows platform
   - add assertion is signal handler to detect thread hitting the Yellow 
Zone while executing JVM code (to detect undersized StackShadowPages 
during our testing)
   - ensure Yellow Pages are activated when transitioning from 
_thread_in_vm to _thread_in_java