RFR(S): JDK-8137035 tests got EXCEPTION_STACK_OVERFLOW on Windows 64 bit

Mon Aug 29 13:25:40 UTC 2016

Hi Fred,

This is the clearest writeup of stack overflow handling that I've seen 
so far, and your fix seems good.  I'm a bit worried about the assertion in

http://cr.openjdk.java.net/~fparain/8137035/webrev.00/src/os/windows/vm/os_windows.cpp.udiff.html

But I think it would be better to hit this in our testing rather than 
spurious stack overflow exceptions.

http://cr.openjdk.java.net/~fparain/8137035/webrev.00/src/cpu/x86/vm/globals_x86.hpp.udiff.html

Should windows 32 bit stack shadow pages be increased also?  Have we 
seen these on 32 bits also?

http://cr.openjdk.java.net/~fparain/8137035/webrev.00/src/share/vm/runtime/interfaceSupport.hpp.udiff.html

Lastly, I really prefer this code here rather than buried in the stack 
overflow throwing logic in the interpreter and compiler.   As we talked 
about, I think this code should be cleaned up.  I don't know why it was 
done this way other than that's where it was historically and we were 
afraid to change it without the extensive analysis that you've done.   
Maybe someone from the compiler team remembers?  (cc'ed)

Thanks,
Coleen

On 8/26/16 4:00 PM, Frederic Parain wrote:
> Hi,
>
> Please review this fix for bug JDK-8137035
> The bug is confidential but it is related to several VM crashes
> that occurred on the Windows 64 bits platform in stack overflow
> conditions. I've copied/pasted the analysis of the bug and the
> description of the fix below.
>
> Webrev:
> http://cr.openjdk.java.net/~fparain/8137035/webrev.00/
>
> Testing: JPRT (testset hotspot) and nsk.stress
>
> Thanks,
>
> Fred
>
> ---------
>
> All these crashes related to stack overflows on Windows have 
> presumably the same causes:
>     - an undersized StackShadowPages parameter
>     - the behavior of guard pages on Windows
>     - a flaw in Yellow Pages management
>
> These three factors combined together can lead to sporadic crashes of 
> the JVM when stack overflow conditions are encountered.
>
> All the crashes listed in this CR and in the related CR are almost 
> impossible to reproduce, which indicates that the issue only shows up 
> in some extreme or uncommon conditions. By design, the JVM crashes on 
> stack overflow only if the Red Zone (the last one in the execution 
> stack) is hit. Before the Red Zone, there's the Yellow Zone which is 
> here to detect and handle stack overflows in a nicer way (throwing a 
> StackOverflowError instead of crashing the process). If the Red zone 
> is hit, it means that the Yellow Zone was
> disabled, and there's only two cases where the Yellow Zone is disabled:
>
>   1 - when a potential stack overflow is detected in Java code, in 
> this case the Yellow Zone is disabled during the generation of the 
> StackOverflowError and restored during the propagation of the 
> StackOverflowError
>   2 - when a stack overflow occurs either in native code or in JVM 
> code, because there's anything else the JVM can do.
>
> In several crashes, the call stack doesn't show any special recursive 
> Java calls that could suggest the JVM is in case 1. But they show 
> relatively complex code paths inside JVM code (de-optimization or 
> class/symbol resolution), which suggests that case 2 occurred.
>
> The case of stack overflow in native code is straight forward: if the 
> Yellow Zone is hit, it is disabled, but when a JavaThread returns from 
> native code to Java code, the Yellow Zone is systematically re-enabled 
> (this is part of the native call wrapper
> generated by the JVM).
>
> The case of stack overflow in JVM code is more problematic. The JVM 
> tries to avoid the case of stack overflow in VM code with the Shadow 
> Pages mechanism. Whenever a Java method is invoked, the JVM tries to 
> ensure that there's enough free stack space to execute the Java method 
> and *any call to the JVM code (or JDK native code) that could occur 
> during the execution of this method*. This check is performed by 
> banging (touching) n pages ahead on the execution stack, and n is set 
> to StackShadowPages. If the Yellow Zone is hit during the stack 
> banging, a StackOverflowError is thrown before the execution of the 
> first bytecode of the Java method. But this mechanism assumes that 
> StackShadowPages pages is big enough to cover *any call to the JVM*. 
> If this assumption is wrong, so
> bad things happen.
>
> I ran experiments with tests for which stack overflow related crashes 
> were reported. I ran them with a JVM where the StackShadowPages value 
> was decreased by only 1 compared the usual default value. It was very 
> easy to reproduce stack overflow crashes. By instrumenting the JVM, it 
> appeared that some threads hit the Yellow Zone while having thread 
> state _thread_in_vm. Which means that in many cases, the margin 
> between the stack space provided by StackShadowPages and the real 
> stack usage while executing VM code is less than one page. And because 
> knowing the biggest stack requirement to execute any JVM code is an 
> undecidable problem, there's a high probability that some paths 
> require more stack space than StackShadowPages ensures. It is 
> important to notice
> that Windows is the platform with the smallest default value for 
> StackShadowPages.
>
> So, an undersized StackShadowPages could cause the Yellow Zone to be 
> hit while executing JVM code. On Unices (Solaris, Linux, MacOSX), the 
> sanction is immediate: a SIGSEGV signal is sent, but because there's 
> no more free space on the execution stack, the signal handler cannot 
> be executed and the JVM process is killed. It's a crash without 
> hs_error file generation.
>
> On Windows, the story is different. Yellow Pages are marked with the 
> "Guard" bit. When a page with a Guard bit set is touched, the current 
> thread receives an exception, but before the exception handler is 
> executed, the OS remove the Guard bit from the page, so the page that 
> trigger the fault can be used to execute the signal handler. So on 
> Windows, when the Yellow Zone is hit while executing JVM code, the JVM 
> doesn't die like on Unices systems, but the signal handler is executed.
>
> The logic in the signal handler looks like this (simplified version):
>
>    if thread touches the yellow zone:
>       if thread_in_java:
>           disable yellow pages
>           jump to code throwing StackOverflowError
>           // note: yellow pages will be re-enabled
>           // while unwinding the stack
>       else:
>           // thread_in_vm or thread_in_native
>           disable yellow pages
>           resume execution
>    else:
>        // Fatal red zone violation.
>        disable red pages
>        generate VM crash
>
> So, the signal handler disable the protection of the Yellow Pages and 
> resume JVM code execution.
>
> Eventually, the thread will return from the VM and will continue 
> executing Java code.  But at this point, the yellow pages are still 
> disabled and there's no systematic check to ensure that Yellow Pages 
> are re-enabled when returning to Java. The only places where the JVM  
> checks if Yellow Pages need to be re-activated is when returning from 
> native code or in the exception propagation code (but not all paths 
> reactivate the Yellow Zone).
>
> Once the execution of Java code has resumed with the yellow zone 
> disabled, the thread is not protected any more against stack 
> overflows. The only remaining protection is the red zone, and if it is 
> hit, the VM will generate a crash report and die. Note that having 
> Yellow Zone de-activated makes the stack banging of StackShadowPages 
> inefficient. Stack banging relies on the Yellow Pages to be activated, 
> so touching them triggers a signal. If Yellow Pages are de-activated 
> (unprotected) no signal is sent, unless the stack banging hits the Red 
> Page, which triggers a VM crash with hs_error file generation.
>
>
> To summarize: an undersized StackShadowPages on Windows can lead to a 
> JavaThread executing Java code with Yellow Pages disabled, which means 
> without any stack overflow protection except the Red Zone which is the 
> one triggering VM crashes with hs_error file generation.
>
> Note that the Yellow Pages can be "incidentally" re-activated by a 
> call to native code  or by throwing an exception. Which could explain 
> why stack overflow crashes are not so frequent, the time window during 
> which Java code is executed without stack overflow protection might be 
> small for some applications.
>
>
> Proposed fixes for this issue:
>   - increase StackShadowPages for the Windows platform
>   - add assertion is signal handler to detect thread hitting the 
> Yellow Zone while executing JVM code (to detect undersized 
> StackShadowPages during our testing)
>   - ensure Yellow Pages are activated when transitioning from 
> _thread_in_vm to _thread_in_java
>