Question about stack overflows in native code

Tue Apr 4 19:04:02 UTC 2017

Hi Fred,

On Tue, 4 Apr 2017 at 20:56, Frederic Parain <frederic.parain at oracle.com>
wrote:

>
>
> On 04/04/2017 02:31 PM, Thomas Stüfe wrote:
> > Hi David,
> >
> > On Tue, Apr 4, 2017 at 12:11 PM, David Holmes <david.holmes at oracle.com
> > <mailto:david.holmes at oracle.com>> wrote:
> >
> >     On 4/04/2017 6:30 PM, Thomas Stüfe wrote:
> >
> >         Hi David,
> >
> >         On Mon, Apr 3, 2017 at 11:02 PM, David Holmes
> >         <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
> >         <mailto:david.holmes at oracle.com
> >         <mailto:david.holmes at oracle.com>>> wrote:
> >
> >             Just to follow up on what Fred responded ...
> >
> >             On 4/04/2017 4:42 AM, Thomas Stüfe wrote:
> >
> >                 Hi Fred,
> >
> >                 thanks! Some more questions inline.
> >
> >                 On Mon, Apr 3, 2017 at 8:29 PM, Frederic Parain
> >                 <frederic.parain at oracle.com
> >         <mailto:frederic.parain at oracle.com>
> >         <mailto:frederic.parain at oracle.com
> >         <mailto:frederic.parain at oracle.com>>>
> >
> >                 wrote:
> >
> >                     When the yellow zone is hit and the thread state is
> >         not in
> >                     _thread_in_java (which means thread state is
> >                     _thread_in_native or
> >                     _thread_in_vm), the yellow zone is silently disabled
> >         and the
> >                     thread
> >                     is allowed to resume its execution.
> >
> >
> >                 Disabled by whom exactly?
> >
> >                 Normally, this would be done in the signal handler, but
> that
> >                 requires
> >                 enough stack space to run. AFAIK jitted or interpreted
> >         code does
> >                 stack
> >                 banging in order to trigger the yellow-page-segfault at
> >         a point
> >                 where there
> >                 are enough pages left on the stack to invoke the signal
> >         handler
> >                 (n shadow
> >                 pages before), but that is not guaranteed to work with
> >         native
> >                 C-compiled
> >                 code, no?
> >
> >
> >             The stack banging is done to ensure the stackoverflow is hit
> >         before
> >             we start doing the actual operation. The size of the yellow
> >         and red
> >             zones are supposed to be sufficient to allow the respective
> >         signal
> >             processing and response to be executed.
> >
> >
> >         And the size of the shadow pages should be sufficient to invoke
> >         initial
> >         signal handler which will unprotect the yellow or red zone,
> right?
> >         So, back to my original question, if native C code does not bang
> the
> >         stack but simply runs into the yellow zone, process will simply
> >         die, or?
> >
> >
> >     I thought Fred already answered that. The signal handler simply
> >     disables the yellow zone and returns:
> >
> >               } else {
> >                 // Thread was in the vm or native code.  Return and try
> >     to finish.
> >                 thread->disable_stack_yellow_reserved_zone();
> >                 return 1;
> >               }
> >
> >
> > But in order to do this it needs at least enough stack space to invoke
> > the signal handler and call mprotect on the yellow page, right? So, for
> > native code compiled by a C-compiler, this may or may not work,
> > depending on whether and what form of stack-banging code the C-Compiler
> > does generate? (It may generate some sort of stack banging to trigger
> > the OS guard page and do OS stack overflow handling, or it may just
> > blindly run into the yellow page when pushing a new frame).
>
> The yellow zone by itself doesn't provide protection against dying
> from a stack overflow. It has been designed to work in coordination
> with the stack banging. With stack banging, a thread will try to
> "touch" some pages down its stack *before* it really needs them.
> This way, if the yellow zone is hit during the stack banging, there's
> enough remaining free stack space before the yellow zone to execute
> the signal handler (which doesn't need a lot of stack space). And
> if the signal handler can disable the yellow zone, then the thread
> has enough stack space to perform more complex operations like
> generating and throwing a StackOverflowError.
>
> Without stack banging, the thread will use is stack space until
> the yellow zone is hit, and usually when it is hit, the process
> will die because there wasn't enough remaining space to execute
> the signal handler.
>
> In Java code, stack banging is performed each time a method is
> invoked, to ensure the thread has enough stack space to execute
> it (the class file provides information about the maximum number
> of local variables and the deepest execution stack the method
> will need). Of course, with JIT compile code, stack banging is
> performed differently because of in-lining.
>
> For code which cannot perform stack banging on method boundaries,
> like the VM code, the approach is different. Each time a thread
> is about to call into the VM runtime, a stack banging is performed
> using the StackShadowPages sizing. Shadow pages is supposed to
> represent a stack space big enough to execute *any* call to VM
> runtime. So, if this stack banging passes, all the runtime code
> is executed without any additional check, hoping that shadow
> pages have been sized correctly.
>
> You can try to add, in your native code, some stack banging code,
> or a logic computing the remaining stack space before the
> guard pages. Not necessarily on every method call, but on well
> known points in your code. The hardest part is usually to know
> how much stack space your native code will need. It's possible
> to start with a big over-estimating value, and refine it later.
> The sizing of the different zones has been determined with a
> trial and error process which still continue today as the
> JVM code and native JDK code evolve.
>
> Fred

Thanks a lot for this excellent and complete explanation!

Kind regards, Thomas

>
> > If stack space is not sufficient to invoke the signal handler to
> > unprotect the yellow/red page, process would silently die, right?
>
> Correct.
>
> >
> >     If it keeps going and hits the red zone then the red zone will be
> >     disabled, we print some error messages, and then should call
> >     VMError::report_and_die(). But I admit the signal handler logic is
> >     quite complex so I may have missed something. :)
> >
> >
> >
> >             But that assumes you simply advance into the guard zones -
> >         if your
> >             native code suddenly jumped to the end of the yellow zone for
> >             example, then signal processing would hit the red zone;
> >         similarly if
> >             you jump to the end of the red zone then signal processing
> >         will hit
> >             the OS guard page. If you jump past all guard pages you
> >         simply die.
> >
> >
> >         Thank you!
> >
> >         See also my response to Fred. We wondered whether exporting a
> >         simple JNI
> >         helper function to check the stack size on behalf of the native
> code
> >         would be something helpful, for cooperative native code at least.
> >
> >
> >     Perhaps. Haven't really thought about it. :)
> >
> >
> > We may experiment a bit. The VM silently dying on native code stack
> > overflows is a huge annoyance, especially since it depends on the
> > user-adjustable stack size. Typically not even a hs_err file is
> generated.
> >
> > Actually not a theoretical problem, I am currently running into this:
> > http://www-01.ibm.com/support/docview.wss?uid=swg1IV23033 for our
> > commercial code base at a customer (not j9 obviously), and while the
> > recursion in the vector calculations can be fixed, it would be nice to
> > at least have an hs_err file...
> >
> >     Cheers,
> >     David
> >
> >
> > Kind Regards, Thomas
> >
> >
> >         Kind Regards, Thomas
> >
> >
> >             David
> >
> >
> >                 (not just a theory, we have a test case here where a
> stack
> >                 overflow in
> >                 native code just silently kills the process.)
> >
> >                 I guess it may work accidentally if the C-compiled code
> >         itself
> >                 does some
> >                 form of stack banging when establishing frames, in order
> to
> >                 detect OS stack
> >                 overflows? Very fuzzy here. But whatever the C-compiled
> code
> >                 does, it has
> >                 no notion about how much space we need to invoke the
> signal
> >                 handler and
> >                 handle stack overflows, no?
> >
> >                 When the red zone is hit, what ever the current thread
> >         state is,
> >
> >                     the red zone is disabled and
> >         VMError::report_and_die() is
> >                     called,
> >                     which should generate a hs_err file unless the
> >         generation of the
> >                     error file requires more memory than the red zone
> >         provides.
> >
> >                     Fred
> >
> >
> >                 Thanks, Thomas
> >
> >
> >
> >
> >                     On 04/03/2017 02:08 PM, Thomas Stüfe wrote:
> >
> >                         Hi,
> >
> >                         Today we wondered what would happen when a stack
> >                         overflow occurs in native
> >                         code running in a java thread (an attached
> >         thread or one
> >                         created by the
> >                         VM).
> >
> >                         In that case yellow and red pages are in place,
> >         but this
> >                         would not help
> >                         much, would it not, because the native code
> >         would not do
> >                         any stack
> >                         banging?
> >
> >                         So, native code would hit the yellow page, and
> then
> >                         there would probably
> >                         not be enough space left on the stack to invoke
> the
> >                         signal handler. The
> >                         result would be immediate VM death - not even an
> >         hs-err
> >                         file - is that
> >                         correct?
> >
> >                         Also, we would hit the our own yellow page, not
> the
> >                         guard page the OS may
> >                         or may not have established, so - on UNIX - this
> >         would
> >                         show up as
> >                         "Segmentation Fault", not "Stack Overflow", or?
> >
> >                         Thank you,
> >
> >                         Thomas
> >
> >
> >
> >
>