Question about stack overflows in native code

Tue Apr 4 18:31:05 UTC 2017

Hi David,

On Tue, Apr 4, 2017 at 12:11 PM, David Holmes <david.holmes at oracle.com>
wrote:

> On 4/04/2017 6:30 PM, Thomas Stüfe wrote:
>
>> Hi David,
>>
>> On Mon, Apr 3, 2017 at 11:02 PM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     Just to follow up on what Fred responded ...
>>
>>     On 4/04/2017 4:42 AM, Thomas Stüfe wrote:
>>
>>         Hi Fred,
>>
>>         thanks! Some more questions inline.
>>
>>         On Mon, Apr 3, 2017 at 8:29 PM, Frederic Parain
>>         <frederic.parain at oracle.com <mailto:frederic.parain at oracle.com>>
>>
>>         wrote:
>>
>>             When the yellow zone is hit and the thread state is not in
>>             _thread_in_java (which means thread state is
>>             _thread_in_native or
>>             _thread_in_vm), the yellow zone is silently disabled and the
>>             thread
>>             is allowed to resume its execution.
>>
>>
>>         Disabled by whom exactly?
>>
>>         Normally, this would be done in the signal handler, but that
>>         requires
>>         enough stack space to run. AFAIK jitted or interpreted code does
>>         stack
>>         banging in order to trigger the yellow-page-segfault at a point
>>         where there
>>         are enough pages left on the stack to invoke the signal handler
>>         (n shadow
>>         pages before), but that is not guaranteed to work with native
>>         C-compiled
>>         code, no?
>>
>>
>>     The stack banging is done to ensure the stackoverflow is hit before
>>     we start doing the actual operation. The size of the yellow and red
>>     zones are supposed to be sufficient to allow the respective signal
>>     processing and response to be executed.
>>
>>
>> And the size of the shadow pages should be sufficient to invoke initial
>> signal handler which will unprotect the yellow or red zone, right?
>> So, back to my original question, if native C code does not bang the
>> stack but simply runs into the yellow zone, process will simply die, or?
>>
>
> I thought Fred already answered that. The signal handler simply disables
> the yellow zone and returns:
>
>           } else {
>             // Thread was in the vm or native code.  Return and try to
> finish.
>             thread->disable_stack_yellow_reserved_zone();
>             return 1;
>           }
>
>
But in order to do this it needs at least enough stack space to invoke the
signal handler and call mprotect on the yellow page, right? So, for native
code compiled by a C-compiler, this may or may not work, depending on
whether and what form of stack-banging code the C-Compiler does generate?
(It may generate some sort of stack banging to trigger the OS guard page
and do OS stack overflow handling, or it may just blindly run into the
yellow page when pushing a new frame).

If stack space is not sufficient to invoke the signal handler to unprotect
the yellow/red page, process would silently die, right?

> If it keeps going and hits the red zone then the red zone will be
> disabled, we print some error messages, and then should call
> VMError::report_and_die(). But I admit the signal handler logic is quite
> complex so I may have missed something. :)
>
>
>>
>>     But that assumes you simply advance into the guard zones - if your
>>     native code suddenly jumped to the end of the yellow zone for
>>     example, then signal processing would hit the red zone; similarly if
>>     you jump to the end of the red zone then signal processing will hit
>>     the OS guard page. If you jump past all guard pages you simply die.
>>
>>
>> Thank you!
>>
>> See also my response to Fred. We wondered whether exporting a simple JNI
>> helper function to check the stack size on behalf of the native code
>> would be something helpful, for cooperative native code at least.
>>
>
> Perhaps. Haven't really thought about it. :)
>
>
We may experiment a bit. The VM silently dying on native code stack
overflows is a huge annoyance, especially since it depends on the
user-adjustable stack size. Typically not even a hs_err file is generated.

Actually not a theoretical problem, I am currently running into this:
http://www-01.ibm.com/support/docview.wss?uid=swg1IV23033 for our
commercial code base at a customer (not j9 obviously), and while the
recursion in the vector calculations can be fixed, it would be nice to at
least have an hs_err file...

Cheers,
> David
>
>
Kind Regards, Thomas

>
> Kind Regards, Thomas
>>
>>
>>     David
>>
>>
>>         (not just a theory, we have a test case here where a stack
>>         overflow in
>>         native code just silently kills the process.)
>>
>>         I guess it may work accidentally if the C-compiled code itself
>>         does some
>>         form of stack banging when establishing frames, in order to
>>         detect OS stack
>>         overflows? Very fuzzy here. But whatever the C-compiled code
>>         does, it has
>>         no notion about how much space we need to invoke the signal
>>         handler and
>>         handle stack overflows, no?
>>
>>         When the red zone is hit, what ever the current thread state is,
>>
>>             the red zone is disabled and VMError::report_and_die() is
>>             called,
>>             which should generate a hs_err file unless the generation of
>> the
>>             error file requires more memory than the red zone provides.
>>
>>             Fred
>>
>>
>>         Thanks, Thomas
>>
>>
>>
>>
>>             On 04/03/2017 02:08 PM, Thomas Stüfe wrote:
>>
>>                 Hi,
>>
>>                 Today we wondered what would happen when a stack
>>                 overflow occurs in native
>>                 code running in a java thread (an attached thread or one
>>                 created by the
>>                 VM).
>>
>>                 In that case yellow and red pages are in place, but this
>>                 would not help
>>                 much, would it not, because the native code would not do
>>                 any stack
>>                 banging?
>>
>>                 So, native code would hit the yellow page, and then
>>                 there would probably
>>                 not be enough space left on the stack to invoke the
>>                 signal handler. The
>>                 result would be immediate VM death - not even an hs-err
>>                 file - is that
>>                 correct?
>>
>>                 Also, we would hit the our own yellow page, not the
>>                 guard page the OS may
>>                 or may not have established, so - on UNIX - this would
>>                 show up as
>>                 "Segmentation Fault", not "Stack Overflow", or?
>>
>>                 Thank you,
>>
>>                 Thomas
>>
>>
>>
>>