RFR(S): JDK-8137035 tests got EXCEPTION_STACK_OVERFLOW on Windows 64 bit

Daniel D. Daugherty daniel.daugherty at oracle.com
Fri Sep 2 05:56:53 UTC 2016


 > Updated webrev:
 > http://cr.openjdk.java.net/~fparain/8137035/webrev.01/index.html

src/cpu/x86/vm/globals_x86.hpp
     No comments.

src/os/windows/vm/os_windows.cpp
     No comments.

src/share/vm/runtime/interfaceSupport.hpp
     L311:   ~ThreadInVMfromJavaNoAsyncException()  {
         Does this destructor also need potentially reenable
         the yellow zone?

Thumbs up!

Just curious, if you reduce the shadow page count by one (from
the current value), you said that you see more yellow zone hits.
Does that mean, this assertion fires:

+        assert(thread->thread_state() != _thread_in_vm, "Undersized 
StackShadowPages");

Outstanding hunt for this elusive bug. Wonderful write up!

Dan


On 8/29/16 8:37 AM, Frederic Parain wrote:
> Hi David,
>
> Thank you for the review.
>
> A few comments in-lined below.
>
> On 08/28/2016 09:36 PM, David Holmes wrote:
>> Hi Fred,
>>
>> On 27/08/2016 6:00 AM, Frederic Parain wrote:
>>> Hi,
>>>
>>> Please review this fix for bug JDK-8137035
>>> The bug is confidential but it is related to several VM crashes
>>> that occurred on the Windows 64 bits platform in stack overflow
>>> conditions. I've copied/pasted the analysis of the bug and the
>>> description of the fix below.
>>
>> The analysis and solution all seem reasonable. Though I do have to
>> wonder how the failure to reenable the yellow zone when returning to
>> Java would not cause far more problem, on all platforms.
>
> Running with Yellow Pages disabled clearly opens the door to random
> crashes. Making the mechanism simpler and more robust would benefit
> to all platforms.
>
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~fparain/8137035/webrev.00/
>>
>> src/os/windows/vm/os_windows.cpp
>>
>> While examining the thread state logic in the exception handler I
>> noticed some pre-existing bugs:
>>
>> 2506   if (exception_code == EXCEPTION_ACCESS_VIOLATION) {
>> 2507     JavaThread* thread = (JavaThread*) t;
>>
>> there is no check that t is in fact a JavaThread, or even that t is
>> non-NULL. Such checks occur slightly later:
>
> I've investigated this issue, and it is currently harmless.
> The casted pointer is only used to call a method requiring
> a JavaThread* pointer and the only usage of its argument it's
> a NULL check. Unfortunately, fixing this issue would require
> to modify the prototype of os::is_memory_serialize_page()
> and propagate the change across all platforms using it.
> It's a wider scope fix than JDK-8137035.
>
> I've added a comment the unsafe cast in os_windows.cpp file,
> highlighting the fact it was unsafe, and explaining why it
> is currently harmless.
>
>>
>> 2523   if (t != NULL && t->is_Java_thread()) {
>> 2524     JavaThread* thread = (JavaThread*) t;
>>
>> This bug seems significant:
>>
>> 2566       if (thread->stack_guards_enabled()) {
>> 2567         if (_thread_in_Java) {
>>
>> _thread_in_Java is an enum value not a variable so we will always
>> execute this block! This code should be testing the local in_java 
>> variable.
>
> Good catch! Fixed.
>
> Updated webrev:
> http://cr.openjdk.java.net/~fparain/8137035/webrev.01/index.html
>
> Thank you,
>
> Fred
>
>> Your changes seem fine in themselves.
>>
>> Thanks,
>> David
>>
>>
>>> Testing: JPRT (testset hotspot) and nsk.stress
>>>
>>> Thanks,
>>>
>>> Fred
>>>
>>> ---------
>>>
>>> All these crashes related to stack overflows on Windows have presumably
>>> the same causes:
>>>     - an undersized StackShadowPages parameter
>>>     - the behavior of guard pages on Windows
>>>     - a flaw in Yellow Pages management
>>>
>>> These three factors combined together can lead to sporadic crashes of
>>> the JVM when stack overflow conditions are encountered.
>>>
>>> All the crashes listed in this CR and in the related CR are almost
>>> impossible to reproduce, which indicates that the issue only shows 
>>> up in
>>> some extreme or uncommon conditions. By design, the JVM crashes on 
>>> stack
>>> overflow only if the Red Zone (the last one in the execution stack) is
>>> hit. Before the Red Zone, there's the Yellow Zone which is here to
>>> detect and handle stack overflows in a nicer way (throwing a
>>> StackOverflowError instead of crashing the process). If the Red zone is
>>> hit, it means that the Yellow Zone was
>>> disabled, and there's only two cases where the Yellow Zone is disabled:
>>>
>>>   1 - when a potential stack overflow is detected in Java code, in this
>>> case the Yellow Zone is disabled during the generation of the
>>> StackOverflowError and restored during the propagation of the
>>> StackOverflowError
>>>   2 - when a stack overflow occurs either in native code or in JVM 
>>> code,
>>> because there's anything else the JVM can do.
>>>
>>> In several crashes, the call stack doesn't show any special recursive
>>> Java calls that could suggest the JVM is in case 1. But they show
>>> relatively complex code paths inside JVM code (de-optimization or
>>> class/symbol resolution), which suggests that case 2 occurred.
>>>
>>> The case of stack overflow in native code is straight forward: if the
>>> Yellow Zone is hit, it is disabled, but when a JavaThread returns from
>>> native code to Java code, the Yellow Zone is systematically re-enabled
>>> (this is part of the native call wrapper
>>> generated by the JVM).
>>>
>>> The case of stack overflow in JVM code is more problematic. The JVM
>>> tries to avoid the case of stack overflow in VM code with the Shadow
>>> Pages mechanism. Whenever a Java method is invoked, the JVM tries to
>>> ensure that there's enough free stack space to execute the Java method
>>> and *any call to the JVM code (or JDK native code) that could occur
>>> during the execution of this method*. This check is performed by 
>>> banging
>>> (touching) n pages ahead on the execution stack, and n is set to
>>> StackShadowPages. If the Yellow Zone is hit during the stack banging, a
>>> StackOverflowError is thrown before the execution of the first bytecode
>>> of the Java method. But this mechanism assumes that StackShadowPages
>>> pages is big enough to cover *any call to the JVM*. If this assumption
>>> is wrong, so
>>> bad things happen.
>>>
>>> I ran experiments with tests for which stack overflow related crashes
>>> were reported. I ran them with a JVM where the StackShadowPages value
>>> was decreased by only 1 compared the usual default value. It was very
>>> easy to reproduce stack overflow crashes. By instrumenting the JVM, it
>>> appeared that some threads hit the Yellow Zone while having thread 
>>> state
>>> _thread_in_vm. Which means that in many cases, the margin between the
>>> stack space provided by StackShadowPages and the real stack usage while
>>> executing VM code is less than one page. And because knowing the 
>>> biggest
>>> stack requirement to execute any JVM code is an undecidable problem,
>>> there's a high probability that some paths require more stack space 
>>> than
>>> StackShadowPages ensures. It is important to notice
>>> that Windows is the platform with the smallest default value for
>>> StackShadowPages.
>>>
>>> So, an undersized StackShadowPages could cause the Yellow Zone to be 
>>> hit
>>> while executing JVM code. On Unices (Solaris, Linux, MacOSX), the
>>> sanction is immediate: a SIGSEGV signal is sent, but because there's no
>>> more free space on the execution stack, the signal handler cannot be
>>> executed and the JVM process is killed. It's a crash without hs_error
>>> file generation.
>>>
>>> On Windows, the story is different. Yellow Pages are marked with the
>>> "Guard" bit. When a page with a Guard bit set is touched, the current
>>> thread receives an exception, but before the exception handler is
>>> executed, the OS remove the Guard bit from the page, so the page that
>>> trigger the fault can be used to execute the signal handler. So on
>>> Windows, when the Yellow Zone is hit while executing JVM code, the JVM
>>> doesn't die like on Unices systems, but the signal handler is executed.
>>>
>>> The logic in the signal handler looks like this (simplified version):
>>>
>>>    if thread touches the yellow zone:
>>>       if thread_in_java:
>>>           disable yellow pages
>>>           jump to code throwing StackOverflowError
>>>           // note: yellow pages will be re-enabled
>>>           // while unwinding the stack
>>>       else:
>>>           // thread_in_vm or thread_in_native
>>>           disable yellow pages
>>>           resume execution
>>>    else:
>>>        // Fatal red zone violation.
>>>        disable red pages
>>>        generate VM crash
>>>
>>> So, the signal handler disable the protection of the Yellow Pages and
>>> resume JVM code execution.
>>>
>>> Eventually, the thread will return from the VM and will continue
>>> executing Java code.  But at this point, the yellow pages are still
>>> disabled and there's no systematic check to ensure that Yellow Pages 
>>> are
>>> re-enabled when returning to Java. The only places where the JVM  
>>> checks
>>> if Yellow Pages need to be re-activated is when returning from native
>>> code or in the exception propagation code (but not all paths reactivate
>>> the Yellow Zone).
>>>
>>> Once the execution of Java code has resumed with the yellow zone
>>> disabled, the thread is not protected any more against stack overflows.
>>> The only remaining protection is the red zone, and if it is hit, the VM
>>> will generate a crash report and die. Note that having Yellow Zone
>>> de-activated makes the stack banging of StackShadowPages inefficient.
>>> Stack banging relies on the Yellow Pages to be activated, so touching
>>> them triggers a signal. If Yellow Pages are de-activated (unprotected)
>>> no signal is sent, unless the stack banging hits the Red Page, which
>>> triggers a VM crash with hs_error file generation.
>>>
>>>
>>> To summarize: an undersized StackShadowPages on Windows can lead to a
>>> JavaThread executing Java code with Yellow Pages disabled, which means
>>> without any stack overflow protection except the Red Zone which is the
>>> one triggering VM crashes with hs_error file generation.
>>>
>>> Note that the Yellow Pages can be "incidentally" re-activated by a call
>>> to native code  or by throwing an exception. Which could explain why
>>> stack overflow crashes are not so frequent, the time window during 
>>> which
>>> Java code is executed without stack overflow protection might be small
>>> for some applications.
>>>
>>>
>>> Proposed fixes for this issue:
>>>   - increase StackShadowPages for the Windows platform
>>>   - add assertion is signal handler to detect thread hitting the Yellow
>>> Zone while executing JVM code (to detect undersized StackShadowPages
>>> during our testing)
>>>   - ensure Yellow Pages are activated when transitioning from
>>> _thread_in_vm to _thread_in_java
>>>



More information about the hotspot-runtime-dev mailing list