Question about stack overflows in native code
coleen.phillimore at oracle.com
coleen.phillimore at oracle.com
Tue Apr 4 19:22:23 UTC 2017
I have a couple of comments below but don't have that much to add to
Frederic's email.
On 4/4/17 3:04 PM, Thomas Stüfe wrote:
> Hi Fred,
>
> On Tue, 4 Apr 2017 at 20:56, Frederic Parain <frederic.parain at oracle.com>
> wrote:
>
>>
>> On 04/04/2017 02:31 PM, Thomas Stüfe wrote:
>>> Hi David,
>>>
>>> On Tue, Apr 4, 2017 at 12:11 PM, David Holmes <david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>> wrote:
>>>
>>> On 4/04/2017 6:30 PM, Thomas Stüfe wrote:
>>>
>>> Hi David,
>>>
>>> On Mon, Apr 3, 2017 at 11:02 PM, David Holmes
>>> <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
>>> <mailto:david.holmes at oracle.com
>>> <mailto:david.holmes at oracle.com>>> wrote:
>>>
>>> Just to follow up on what Fred responded ...
>>>
>>> On 4/04/2017 4:42 AM, Thomas Stüfe wrote:
>>>
>>> Hi Fred,
>>>
>>> thanks! Some more questions inline.
>>>
>>> On Mon, Apr 3, 2017 at 8:29 PM, Frederic Parain
>>> <frederic.parain at oracle.com
>>> <mailto:frederic.parain at oracle.com>
>>> <mailto:frederic.parain at oracle.com
>>> <mailto:frederic.parain at oracle.com>>>
>>>
>>> wrote:
>>>
>>> When the yellow zone is hit and the thread state is
>>> not in
>>> _thread_in_java (which means thread state is
>>> _thread_in_native or
>>> _thread_in_vm), the yellow zone is silently disabled
>>> and the
>>> thread
>>> is allowed to resume its execution.
>>>
>>>
>>> Disabled by whom exactly?
>>>
>>> Normally, this would be done in the signal handler, but
>> that
>>> requires
>>> enough stack space to run. AFAIK jitted or interpreted
>>> code does
>>> stack
>>> banging in order to trigger the yellow-page-segfault at
>>> a point
>>> where there
>>> are enough pages left on the stack to invoke the signal
>>> handler
>>> (n shadow
>>> pages before), but that is not guaranteed to work with
>>> native
>>> C-compiled
>>> code, no?
>>>
>>>
>>> The stack banging is done to ensure the stackoverflow is hit
>>> before
>>> we start doing the actual operation. The size of the yellow
>>> and red
>>> zones are supposed to be sufficient to allow the respective
>>> signal
>>> processing and response to be executed.
>>>
>>>
>>> And the size of the shadow pages should be sufficient to invoke
>>> initial
>>> signal handler which will unprotect the yellow or red zone,
>> right?
>>> So, back to my original question, if native C code does not bang
>> the
>>> stack but simply runs into the yellow zone, process will simply
>>> die, or?
>>>
>>>
>>> I thought Fred already answered that. The signal handler simply
>>> disables the yellow zone and returns:
>>>
>>> } else {
>>> // Thread was in the vm or native code. Return and try
>>> to finish.
>>> thread->disable_stack_yellow_reserved_zone();
>>> return 1;
>>> }
>>>
>>>
>>> But in order to do this it needs at least enough stack space to invoke
>>> the signal handler and call mprotect on the yellow page, right? So, for
>>> native code compiled by a C-compiler, this may or may not work,
>>> depending on whether and what form of stack-banging code the C-Compiler
>>> does generate? (It may generate some sort of stack banging to trigger
>>> the OS guard page and do OS stack overflow handling, or it may just
>>> blindly run into the yellow page when pushing a new frame).
>> The yellow zone by itself doesn't provide protection against dying
>> from a stack overflow. It has been designed to work in coordination
>> with the stack banging. With stack banging, a thread will try to
>> "touch" some pages down its stack *before* it really needs them.
>> This way, if the yellow zone is hit during the stack banging, there's
>> enough remaining free stack space before the yellow zone to execute
>> the signal handler (which doesn't need a lot of stack space). And
>> if the signal handler can disable the yellow zone, then the thread
>> has enough stack space to perform more complex operations like
>> generating and throwing a StackOverflowError.
>>
>> Without stack banging, the thread will use is stack space until
>> the yellow zone is hit, and usually when it is hit, the process
>> will die because there wasn't enough remaining space to execute
>> the signal handler.
>>
>> In Java code, stack banging is performed each time a method is
>> invoked, to ensure the thread has enough stack space to execute
>> it (the class file provides information about the maximum number
>> of local variables and the deepest execution stack the method
>> will need). Of course, with JIT compile code, stack banging is
>> performed differently because of in-lining.
>>
>> For code which cannot perform stack banging on method boundaries,
>> like the VM code, the approach is different. Each time a thread
>> is about to call into the VM runtime, a stack banging is performed
>> using the StackShadowPages sizing. Shadow pages is supposed to
>> represent a stack space big enough to execute *any* call to VM
>> runtime. So, if this stack banging passes, all the runtime code
>> is executed without any additional check, hoping that shadow
>> pages have been sized correctly.
The StackShadowPage mechanism is what is supposed to protect you in this
situation, but it's been known to be difficult to size correctly
especially at a customer site, and you may not want to globally have a
large number of shadow pages.
>>
>> You can try to add, in your native code, some stack banging code,
>> or a logic computing the remaining stack space before the
>> guard pages. Not necessarily on every method call, but on well
>> known points in your code. The hardest part is usually to know
>> how much stack space your native code will need. It's possible
>> to start with a big over-estimating value, and refine it later.
>> The sizing of the different zones has been determined with a
>> trial and error process which still continue today as the
>> JVM code and native JDK code evolve.
>>
>> Fred
Of course, there's always alternate signal stacks. It's been a couple
of years since they came up last.
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2011-August/002403.html
Coleen
> Thanks a lot for this excellent and complete explanation!
>
> Kind regards, Thomas
>
>
>>> If stack space is not sufficient to invoke the signal handler to
>>> unprotect the yellow/red page, process would silently die, right?
>> Correct.
>>
>>> If it keeps going and hits the red zone then the red zone will be
>>> disabled, we print some error messages, and then should call
>>> VMError::report_and_die(). But I admit the signal handler logic is
>>> quite complex so I may have missed something. :)
>>>
>>>
>>>
>>> But that assumes you simply advance into the guard zones -
>>> if your
>>> native code suddenly jumped to the end of the yellow zone for
>>> example, then signal processing would hit the red zone;
>>> similarly if
>>> you jump to the end of the red zone then signal processing
>>> will hit
>>> the OS guard page. If you jump past all guard pages you
>>> simply die.
>>>
>>>
>>> Thank you!
>>>
>>> See also my response to Fred. We wondered whether exporting a
>>> simple JNI
>>> helper function to check the stack size on behalf of the native
>> code
>>> would be something helpful, for cooperative native code at least.
>>>
>>>
>>> Perhaps. Haven't really thought about it. :)
>>>
>>>
>>> We may experiment a bit. The VM silently dying on native code stack
>>> overflows is a huge annoyance, especially since it depends on the
>>> user-adjustable stack size. Typically not even a hs_err file is
>> generated.
>>> Actually not a theoretical problem, I am currently running into this:
>>> http://www-01.ibm.com/support/docview.wss?uid=swg1IV23033 for our
>>> commercial code base at a customer (not j9 obviously), and while the
>>> recursion in the vector calculations can be fixed, it would be nice to
>>> at least have an hs_err file...
>>>
>>> Cheers,
>>> David
>>>
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>> Kind Regards, Thomas
>>>
>>>
>>> David
>>>
>>>
>>> (not just a theory, we have a test case here where a
>> stack
>>> overflow in
>>> native code just silently kills the process.)
>>>
>>> I guess it may work accidentally if the C-compiled code
>>> itself
>>> does some
>>> form of stack banging when establishing frames, in order
>> to
>>> detect OS stack
>>> overflows? Very fuzzy here. But whatever the C-compiled
>> code
>>> does, it has
>>> no notion about how much space we need to invoke the
>> signal
>>> handler and
>>> handle stack overflows, no?
>>>
>>> When the red zone is hit, what ever the current thread
>>> state is,
>>>
>>> the red zone is disabled and
>>> VMError::report_and_die() is
>>> called,
>>> which should generate a hs_err file unless the
>>> generation of the
>>> error file requires more memory than the red zone
>>> provides.
>>>
>>> Fred
>>>
>>>
>>> Thanks, Thomas
>>>
>>>
>>>
>>>
>>> On 04/03/2017 02:08 PM, Thomas Stüfe wrote:
>>>
>>> Hi,
>>>
>>> Today we wondered what would happen when a stack
>>> overflow occurs in native
>>> code running in a java thread (an attached
>>> thread or one
>>> created by the
>>> VM).
>>>
>>> In that case yellow and red pages are in place,
>>> but this
>>> would not help
>>> much, would it not, because the native code
>>> would not do
>>> any stack
>>> banging?
>>>
>>> So, native code would hit the yellow page, and
>> then
>>> there would probably
>>> not be enough space left on the stack to invoke
>> the
>>> signal handler. The
>>> result would be immediate VM death - not even an
>>> hs-err
>>> file - is that
>>> correct?
>>>
>>> Also, we would hit the our own yellow page, not
>> the
>>> guard page the OS may
>>> or may not have established, so - on UNIX - this
>>> would
>>> show up as
>>> "Segmentation Fault", not "Stack Overflow", or?
>>>
>>> Thank you,
>>>
>>> Thomas
>>>
>>>
>>>
>>>
More information about the hotspot-runtime-dev
mailing list