Question about stack overflows in native code

Tue Apr 4 19:22:23 UTC 2017

I have a couple of comments below but don't have that much to add to 
Frederic's email.

On 4/4/17 3:04 PM, Thomas Stüfe wrote:
> Hi Fred,
>
> On Tue, 4 Apr 2017 at 20:56, Frederic Parain <frederic.parain at oracle.com>
> wrote:
>
>>
>> On 04/04/2017 02:31 PM, Thomas Stüfe wrote:
>>> Hi David,
>>>
>>> On Tue, Apr 4, 2017 at 12:11 PM, David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>>      On 4/04/2017 6:30 PM, Thomas Stüfe wrote:
>>>
>>>          Hi David,
>>>
>>>          On Mon, Apr 3, 2017 at 11:02 PM, David Holmes
>>>          <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>>          <mailto:david.holmes at oracle.com
>>>          <mailto:david.holmes at oracle.com>>> wrote:
>>>
>>>              Just to follow up on what Fred responded ...
>>>
>>>              On 4/04/2017 4:42 AM, Thomas Stüfe wrote:
>>>
>>>                  Hi Fred,
>>>
>>>                  thanks! Some more questions inline.
>>>
>>>                  On Mon, Apr 3, 2017 at 8:29 PM, Frederic Parain
>>>                  <frederic.parain at oracle.com
>>>          <mailto:frederic.parain at oracle.com>
>>>          <mailto:frederic.parain at oracle.com
>>>          <mailto:frederic.parain at oracle.com>>>
>>>
>>>                  wrote:
>>>
>>>                      When the yellow zone is hit and the thread state is
>>>          not in
>>>                      _thread_in_java (which means thread state is
>>>                      _thread_in_native or
>>>                      _thread_in_vm), the yellow zone is silently disabled
>>>          and the
>>>                      thread
>>>                      is allowed to resume its execution.
>>>
>>>
>>>                  Disabled by whom exactly?
>>>
>>>                  Normally, this would be done in the signal handler, but
>> that
>>>                  requires
>>>                  enough stack space to run. AFAIK jitted or interpreted
>>>          code does
>>>                  stack
>>>                  banging in order to trigger the yellow-page-segfault at
>>>          a point
>>>                  where there
>>>                  are enough pages left on the stack to invoke the signal
>>>          handler
>>>                  (n shadow
>>>                  pages before), but that is not guaranteed to work with
>>>          native
>>>                  C-compiled
>>>                  code, no?
>>>
>>>
>>>              The stack banging is done to ensure the stackoverflow is hit
>>>          before
>>>              we start doing the actual operation. The size of the yellow
>>>          and red
>>>              zones are supposed to be sufficient to allow the respective
>>>          signal
>>>              processing and response to be executed.
>>>
>>>
>>>          And the size of the shadow pages should be sufficient to invoke
>>>          initial
>>>          signal handler which will unprotect the yellow or red zone,
>> right?
>>>          So, back to my original question, if native C code does not bang
>> the
>>>          stack but simply runs into the yellow zone, process will simply
>>>          die, or?
>>>
>>>
>>>      I thought Fred already answered that. The signal handler simply
>>>      disables the yellow zone and returns:
>>>
>>>                } else {
>>>                  // Thread was in the vm or native code.  Return and try
>>>      to finish.
>>>                  thread->disable_stack_yellow_reserved_zone();
>>>                  return 1;
>>>                }
>>>
>>>
>>> But in order to do this it needs at least enough stack space to invoke
>>> the signal handler and call mprotect on the yellow page, right? So, for
>>> native code compiled by a C-compiler, this may or may not work,
>>> depending on whether and what form of stack-banging code the C-Compiler
>>> does generate? (It may generate some sort of stack banging to trigger
>>> the OS guard page and do OS stack overflow handling, or it may just
>>> blindly run into the yellow page when pushing a new frame).
>> The yellow zone by itself doesn't provide protection against dying
>> from a stack overflow. It has been designed to work in coordination
>> with the stack banging. With stack banging, a thread will try to
>> "touch" some pages down its stack *before* it really needs them.
>> This way, if the yellow zone is hit during the stack banging, there's
>> enough remaining free stack space before the yellow zone to execute
>> the signal handler (which doesn't need a lot of stack space). And
>> if the signal handler can disable the yellow zone, then the thread
>> has enough stack space to perform more complex operations like
>> generating and throwing a StackOverflowError.
>>
>> Without stack banging, the thread will use is stack space until
>> the yellow zone is hit, and usually when it is hit, the process
>> will die because there wasn't enough remaining space to execute
>> the signal handler.
>>
>> In Java code, stack banging is performed each time a method is
>> invoked, to ensure the thread has enough stack space to execute
>> it (the class file provides information about the maximum number
>> of local variables and the deepest execution stack the method
>> will need). Of course, with JIT compile code, stack banging is
>> performed differently because of in-lining.
>>
>> For code which cannot perform stack banging on method boundaries,
>> like the VM code, the approach is different. Each time a thread
>> is about to call into the VM runtime, a stack banging is performed
>> using the StackShadowPages sizing. Shadow pages is supposed to
>> represent a stack space big enough to execute *any* call to VM
>> runtime. So, if this stack banging passes, all the runtime code
>> is executed without any additional check, hoping that shadow
>> pages have been sized correctly.

The StackShadowPage mechanism is what is supposed to protect you in this 
situation, but it's been known to be difficult to size correctly 
especially at a customer site, and you may not want to globally have a 
large number of shadow pages.
>>
>> You can try to add, in your native code, some stack banging code,
>> or a logic computing the remaining stack space before the
>> guard pages. Not necessarily on every method call, but on well
>> known points in your code. The hardest part is usually to know
>> how much stack space your native code will need. It's possible
>> to start with a big over-estimating value, and refine it later.
>> The sizing of the different zones has been determined with a
>> trial and error process which still continue today as the
>> JVM code and native JDK code evolve.
>>
>> Fred

Of course, there's always alternate signal stacks.  It's been a couple 
of years since they came up last.

http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2011-August/002403.html

Coleen
> Thanks a lot for this excellent and complete explanation!
>
> Kind regards, Thomas
>
>
>>> If stack space is not sufficient to invoke the signal handler to
>>> unprotect the yellow/red page, process would silently die, right?
>> Correct.
>>
>>>      If it keeps going and hits the red zone then the red zone will be
>>>      disabled, we print some error messages, and then should call
>>>      VMError::report_and_die(). But I admit the signal handler logic is
>>>      quite complex so I may have missed something. :)
>>>
>>>
>>>
>>>              But that assumes you simply advance into the guard zones -
>>>          if your
>>>              native code suddenly jumped to the end of the yellow zone for
>>>              example, then signal processing would hit the red zone;
>>>          similarly if
>>>              you jump to the end of the red zone then signal processing
>>>          will hit
>>>              the OS guard page. If you jump past all guard pages you
>>>          simply die.
>>>
>>>
>>>          Thank you!
>>>
>>>          See also my response to Fred. We wondered whether exporting a
>>>          simple JNI
>>>          helper function to check the stack size on behalf of the native
>> code
>>>          would be something helpful, for cooperative native code at least.
>>>
>>>
>>>      Perhaps. Haven't really thought about it. :)
>>>
>>>
>>> We may experiment a bit. The VM silently dying on native code stack
>>> overflows is a huge annoyance, especially since it depends on the
>>> user-adjustable stack size. Typically not even a hs_err file is
>> generated.
>>> Actually not a theoretical problem, I am currently running into this:
>>> http://www-01.ibm.com/support/docview.wss?uid=swg1IV23033 for our
>>> commercial code base at a customer (not j9 obviously), and while the
>>> recursion in the vector calculations can be fixed, it would be nice to
>>> at least have an hs_err file...
>>>
>>>      Cheers,
>>>      David
>>>
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>>          Kind Regards, Thomas
>>>
>>>
>>>              David
>>>
>>>
>>>                  (not just a theory, we have a test case here where a
>> stack
>>>                  overflow in
>>>                  native code just silently kills the process.)
>>>
>>>                  I guess it may work accidentally if the C-compiled code
>>>          itself
>>>                  does some
>>>                  form of stack banging when establishing frames, in order
>> to
>>>                  detect OS stack
>>>                  overflows? Very fuzzy here. But whatever the C-compiled
>> code
>>>                  does, it has
>>>                  no notion about how much space we need to invoke the
>> signal
>>>                  handler and
>>>                  handle stack overflows, no?
>>>
>>>                  When the red zone is hit, what ever the current thread
>>>          state is,
>>>
>>>                      the red zone is disabled and
>>>          VMError::report_and_die() is
>>>                      called,
>>>                      which should generate a hs_err file unless the
>>>          generation of the
>>>                      error file requires more memory than the red zone
>>>          provides.
>>>
>>>                      Fred
>>>
>>>
>>>                  Thanks, Thomas
>>>
>>>
>>>
>>>
>>>                      On 04/03/2017 02:08 PM, Thomas Stüfe wrote:
>>>
>>>                          Hi,
>>>
>>>                          Today we wondered what would happen when a stack
>>>                          overflow occurs in native
>>>                          code running in a java thread (an attached
>>>          thread or one
>>>                          created by the
>>>                          VM).
>>>
>>>                          In that case yellow and red pages are in place,
>>>          but this
>>>                          would not help
>>>                          much, would it not, because the native code
>>>          would not do
>>>                          any stack
>>>                          banging?
>>>
>>>                          So, native code would hit the yellow page, and
>> then
>>>                          there would probably
>>>                          not be enough space left on the stack to invoke
>> the
>>>                          signal handler. The
>>>                          result would be immediate VM death - not even an
>>>          hs-err
>>>                          file - is that
>>>                          correct?
>>>
>>>                          Also, we would hit the our own yellow page, not
>> the
>>>                          guard page the OS may
>>>                          or may not have established, so - on UNIX - this
>>>          would
>>>                          show up as
>>>                          "Segmentation Fault", not "Stack Overflow", or?
>>>
>>>                          Thank you,
>>>
>>>                          Thomas
>>>
>>>
>>>
>>>