RFR(S): JDK-8137035 tests got EXCEPTION_STACK_OVERFLOW on Windows 64 bit

Fri Sep 2 15:14:51 UTC 2016

Hi Dan,

Thank you for the review.
Answers in-lined below.

On 09/02/2016 01:56 AM, Daniel D. Daugherty wrote:
>> Updated webrev:
>> http://cr.openjdk.java.net/~fparain/8137035/webrev.01/index.html
>
> src/cpu/x86/vm/globals_x86.hpp
>     No comments.
>
> src/os/windows/vm/os_windows.cpp
>     No comments.
>
> src/share/vm/runtime/interfaceSupport.hpp
>     L311:   ~ThreadInVMfromJavaNoAsyncException()  {
>         Does this destructor also need potentially reenable
>         the yellow zone?

Good catch, I've fixed that:
http://cr.openjdk.java.net/~fparain/8137035/webrev.02/

> Thumbs up!
>
> Just curious, if you reduce the shadow page count by one (from
> the current value), you said that you see more yellow zone hits.
> Does that mean, this assertion fires:
>
> +        assert(thread->thread_state() != _thread_in_vm, "Undersized
> StackShadowPages");

Yes, I used a fatal rather than an assert because I did the experiment
with both debug and product builds, but I turned it into an assert
for the final fix.

Thank you,

Fred

> Outstanding hunt for this elusive bug. Wonderful write up!
>
> Dan
>
>
> On 8/29/16 8:37 AM, Frederic Parain wrote:
>> Hi David,
>>
>> Thank you for the review.
>>
>> A few comments in-lined below.
>>
>> On 08/28/2016 09:36 PM, David Holmes wrote:
>>> Hi Fred,
>>>
>>> On 27/08/2016 6:00 AM, Frederic Parain wrote:
>>>> Hi,
>>>>
>>>> Please review this fix for bug JDK-8137035
>>>> The bug is confidential but it is related to several VM crashes
>>>> that occurred on the Windows 64 bits platform in stack overflow
>>>> conditions. I've copied/pasted the analysis of the bug and the
>>>> description of the fix below.
>>>
>>> The analysis and solution all seem reasonable. Though I do have to
>>> wonder how the failure to reenable the yellow zone when returning to
>>> Java would not cause far more problem, on all platforms.
>>
>> Running with Yellow Pages disabled clearly opens the door to random
>> crashes. Making the mechanism simpler and more robust would benefit
>> to all platforms.
>>
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~fparain/8137035/webrev.00/
>>>
>>> src/os/windows/vm/os_windows.cpp
>>>
>>> While examining the thread state logic in the exception handler I
>>> noticed some pre-existing bugs:
>>>
>>> 2506   if (exception_code == EXCEPTION_ACCESS_VIOLATION) {
>>> 2507     JavaThread* thread = (JavaThread*) t;
>>>
>>> there is no check that t is in fact a JavaThread, or even that t is
>>> non-NULL. Such checks occur slightly later:
>>
>> I've investigated this issue, and it is currently harmless.
>> The casted pointer is only used to call a method requiring
>> a JavaThread* pointer and the only usage of its argument it's
>> a NULL check. Unfortunately, fixing this issue would require
>> to modify the prototype of os::is_memory_serialize_page()
>> and propagate the change across all platforms using it.
>> It's a wider scope fix than JDK-8137035.
>>
>> I've added a comment the unsafe cast in os_windows.cpp file,
>> highlighting the fact it was unsafe, and explaining why it
>> is currently harmless.
>>
>>>
>>> 2523   if (t != NULL && t->is_Java_thread()) {
>>> 2524     JavaThread* thread = (JavaThread*) t;
>>>
>>> This bug seems significant:
>>>
>>> 2566       if (thread->stack_guards_enabled()) {
>>> 2567         if (_thread_in_Java) {
>>>
>>> _thread_in_Java is an enum value not a variable so we will always
>>> execute this block! This code should be testing the local in_java
>>> variable.
>>
>> Good catch! Fixed.
>>
>> Updated webrev:
>> http://cr.openjdk.java.net/~fparain/8137035/webrev.01/index.html
>>
>> Thank you,
>>
>> Fred
>>
>>> Your changes seem fine in themselves.
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>> Testing: JPRT (testset hotspot) and nsk.stress
>>>>
>>>> Thanks,
>>>>
>>>> Fred
>>>>
>>>> ---------
>>>>
>>>> All these crashes related to stack overflows on Windows have presumably
>>>> the same causes:
>>>>     - an undersized StackShadowPages parameter
>>>>     - the behavior of guard pages on Windows
>>>>     - a flaw in Yellow Pages management
>>>>
>>>> These three factors combined together can lead to sporadic crashes of
>>>> the JVM when stack overflow conditions are encountered.
>>>>
>>>> All the crashes listed in this CR and in the related CR are almost
>>>> impossible to reproduce, which indicates that the issue only shows
>>>> up in
>>>> some extreme or uncommon conditions. By design, the JVM crashes on
>>>> stack
>>>> overflow only if the Red Zone (the last one in the execution stack) is
>>>> hit. Before the Red Zone, there's the Yellow Zone which is here to
>>>> detect and handle stack overflows in a nicer way (throwing a
>>>> StackOverflowError instead of crashing the process). If the Red zone is
>>>> hit, it means that the Yellow Zone was
>>>> disabled, and there's only two cases where the Yellow Zone is disabled:
>>>>
>>>>   1 - when a potential stack overflow is detected in Java code, in this
>>>> case the Yellow Zone is disabled during the generation of the
>>>> StackOverflowError and restored during the propagation of the
>>>> StackOverflowError
>>>>   2 - when a stack overflow occurs either in native code or in JVM
>>>> code,
>>>> because there's anything else the JVM can do.
>>>>
>>>> In several crashes, the call stack doesn't show any special recursive
>>>> Java calls that could suggest the JVM is in case 1. But they show
>>>> relatively complex code paths inside JVM code (de-optimization or
>>>> class/symbol resolution), which suggests that case 2 occurred.
>>>>
>>>> The case of stack overflow in native code is straight forward: if the
>>>> Yellow Zone is hit, it is disabled, but when a JavaThread returns from
>>>> native code to Java code, the Yellow Zone is systematically re-enabled
>>>> (this is part of the native call wrapper
>>>> generated by the JVM).
>>>>
>>>> The case of stack overflow in JVM code is more problematic. The JVM
>>>> tries to avoid the case of stack overflow in VM code with the Shadow
>>>> Pages mechanism. Whenever a Java method is invoked, the JVM tries to
>>>> ensure that there's enough free stack space to execute the Java method
>>>> and *any call to the JVM code (or JDK native code) that could occur
>>>> during the execution of this method*. This check is performed by
>>>> banging
>>>> (touching) n pages ahead on the execution stack, and n is set to
>>>> StackShadowPages. If the Yellow Zone is hit during the stack banging, a
>>>> StackOverflowError is thrown before the execution of the first bytecode
>>>> of the Java method. But this mechanism assumes that StackShadowPages
>>>> pages is big enough to cover *any call to the JVM*. If this assumption
>>>> is wrong, so
>>>> bad things happen.
>>>>
>>>> I ran experiments with tests for which stack overflow related crashes
>>>> were reported. I ran them with a JVM where the StackShadowPages value
>>>> was decreased by only 1 compared the usual default value. It was very
>>>> easy to reproduce stack overflow crashes. By instrumenting the JVM, it
>>>> appeared that some threads hit the Yellow Zone while having thread
>>>> state
>>>> _thread_in_vm. Which means that in many cases, the margin between the
>>>> stack space provided by StackShadowPages and the real stack usage while
>>>> executing VM code is less than one page. And because knowing the
>>>> biggest
>>>> stack requirement to execute any JVM code is an undecidable problem,
>>>> there's a high probability that some paths require more stack space
>>>> than
>>>> StackShadowPages ensures. It is important to notice
>>>> that Windows is the platform with the smallest default value for
>>>> StackShadowPages.
>>>>
>>>> So, an undersized StackShadowPages could cause the Yellow Zone to be
>>>> hit
>>>> while executing JVM code. On Unices (Solaris, Linux, MacOSX), the
>>>> sanction is immediate: a SIGSEGV signal is sent, but because there's no
>>>> more free space on the execution stack, the signal handler cannot be
>>>> executed and the JVM process is killed. It's a crash without hs_error
>>>> file generation.
>>>>
>>>> On Windows, the story is different. Yellow Pages are marked with the
>>>> "Guard" bit. When a page with a Guard bit set is touched, the current
>>>> thread receives an exception, but before the exception handler is
>>>> executed, the OS remove the Guard bit from the page, so the page that
>>>> trigger the fault can be used to execute the signal handler. So on
>>>> Windows, when the Yellow Zone is hit while executing JVM code, the JVM
>>>> doesn't die like on Unices systems, but the signal handler is executed.
>>>>
>>>> The logic in the signal handler looks like this (simplified version):
>>>>
>>>>    if thread touches the yellow zone:
>>>>       if thread_in_java:
>>>>           disable yellow pages
>>>>           jump to code throwing StackOverflowError
>>>>           // note: yellow pages will be re-enabled
>>>>           // while unwinding the stack
>>>>       else:
>>>>           // thread_in_vm or thread_in_native
>>>>           disable yellow pages
>>>>           resume execution
>>>>    else:
>>>>        // Fatal red zone violation.
>>>>        disable red pages
>>>>        generate VM crash
>>>>
>>>> So, the signal handler disable the protection of the Yellow Pages and
>>>> resume JVM code execution.
>>>>
>>>> Eventually, the thread will return from the VM and will continue
>>>> executing Java code.  But at this point, the yellow pages are still
>>>> disabled and there's no systematic check to ensure that Yellow Pages
>>>> are
>>>> re-enabled when returning to Java. The only places where the JVM
>>>> checks
>>>> if Yellow Pages need to be re-activated is when returning from native
>>>> code or in the exception propagation code (but not all paths reactivate
>>>> the Yellow Zone).
>>>>
>>>> Once the execution of Java code has resumed with the yellow zone
>>>> disabled, the thread is not protected any more against stack overflows.
>>>> The only remaining protection is the red zone, and if it is hit, the VM
>>>> will generate a crash report and die. Note that having Yellow Zone
>>>> de-activated makes the stack banging of StackShadowPages inefficient.
>>>> Stack banging relies on the Yellow Pages to be activated, so touching
>>>> them triggers a signal. If Yellow Pages are de-activated (unprotected)
>>>> no signal is sent, unless the stack banging hits the Red Page, which
>>>> triggers a VM crash with hs_error file generation.
>>>>
>>>>
>>>> To summarize: an undersized StackShadowPages on Windows can lead to a
>>>> JavaThread executing Java code with Yellow Pages disabled, which means
>>>> without any stack overflow protection except the Red Zone which is the
>>>> one triggering VM crashes with hs_error file generation.
>>>>
>>>> Note that the Yellow Pages can be "incidentally" re-activated by a call
>>>> to native code  or by throwing an exception. Which could explain why
>>>> stack overflow crashes are not so frequent, the time window during
>>>> which
>>>> Java code is executed without stack overflow protection might be small
>>>> for some applications.
>>>>
>>>>
>>>> Proposed fixes for this issue:
>>>>   - increase StackShadowPages for the Windows platform
>>>>   - add assertion is signal handler to detect thread hitting the Yellow
>>>> Zone while executing JVM code (to detect undersized StackShadowPages
>>>> during our testing)
>>>>   - ensure Yellow Pages are activated when transitioning from
>>>> _thread_in_vm to _thread_in_java
>>>>
>