Can't get hs_err log on native stack overflow on Linux

David Holmes David.Holmes at oracle.com
Tue Aug 9 18:18:13 PDT 2011


There would need to be some further discussion about why the alternate 
signal stack was dropped before we could consider reinstating it. Things may 
not be quite a simple as they seem.

David

On 10/08/2011 10:54 AM, Yasumasa Suenaga wrote:
> Hi,
>
> I agree to David.
>
>
> BTW, I would like to explain about trouble of my customer.
>
> My customer runs J2EE application (Pure Java) on JBoss.
> java process (on RHEL5 x86_64) which runs JBoss had gone suddenly.
>
> I requested hs_err log and core image. However, customer couldn't
> find hs_err log.
> I got core image and syslog (/var/log/messages), and checked Java
> level stack trace with jstack.
>
> /***************/
> Thread 11548: (state = IN_NATIVE)
> - java.net.SocketOutputStream.socketWrite0(java.io.FileDescriptor, byte[],
> int, int) @bci=0 (Interpreted frame)
> - java.net.SocketOutputStream.socketWrite(byte[], int, int) @bci=44, line=92
> (Interpreted frame)
> - java.net.SocketOutputStream.write(byte[], int, int) @bci=4, line=136
> (Interpreted frame)
> - oracle.net.ns.DataPacket.send(int) @bci=144, line=199 (Interpreted frame)
> - oracle.net.ns.NetOutputStream.flush() @bci=15, line=211 (Interpreted frame)
> - oracle.net.ns.NetInputStream.getNextPacket() @bci=41, line=227
> (Interpreted frame)
>
> :
>
> /***************/
>
> This thread has 397 Java frames !!
> In core image, crashed instruction is "MOV" which has RSP register
> in destination operand. Value of RSP points memory region which has
> no permission.
>
> /***************/
> Program Headers:
> Type Offset VirtAddr PhysAddr
> FileSiz MemSiz Flags Align
>
> :
>
> LOAD 0x0000000001fba000 0x0000000042e2e000 0x0000000000000000
> 0x0000000000003000 0x0000000000003000 1000
>
> :
>
> /***************/
>
> Thus I was convinced that this crash was caused by native stack overflow.
> I suggested expanding stack size (-Xss), and customer has not reproduced
> this trouble.
>
>
> My customer sets "-Xss128k" to reduce physical memory usage (for native
> thread stack, not Java Heap) because there is a possibility of generating
> thousands of threads in JBoss.
>
>
> In this case, frankly speaking, Java application is bad :-p
> and I think that this is an unusual case.
> However, Java class library has JNI implementation such as Network I/O .
> So, This problem happens anywhere in Pure Java application.
>
> Thus I made a patch and posted it, and I think that we should fix this
> problem to work this function.
>
>
> Thanks,
>
> Yasumasa
>
>
> (2011/08/09 19:38), David Holmes wrote:
>> Dmitry Samersoff said the following on 08/09/11 19:30:
>>> Yasumasa,
>>>
>>> Try to increase stack guard size by -XX:StackShadowPages=...
>>> It should work since 6u25 (hs20) see. 6983240 for details.
>>
>> Changing the number of shadow pages has no affect here. It seems that when
>> native code consumes all the stack the VM does not trap it or report it:
>>
>> // Handle ALL stack overflow variations here
>> if (sig == SIGSEGV && info->si_code == SEGV_ACCERR) {
>> address addr = (address) info->si_addr;
>> if (thread->in_stack_yellow_zone(addr)) {
>> thread->disable_stack_yellow_zone();
>> if (thread->thread_state() == _thread_in_Java) {
>> // Throw a stack overflow exception. Guard pages will be reenabled
>> // while unwinding the stack.
>> stub = SharedRuntime::continuation_for_implicit_exception(thread, pc,
>> SharedRuntime::STACK_OVERFLOW);
>> } else {
>> // Thread was in the vm or native code. Return and try to finish.
>> return true;
>> }
>> } else if (thread->in_stack_red_zone(addr)) {
>> // Fatal red zone violation. Disable the guard pages and fall through
>> // to handle_unexpected_exception way down below.
>> thread->disable_stack_red_zone();
>> tty->print_raw_cr("An irrecoverable stack overflow has occurred.");
>> }
>>
>> If we hit the yellow zone while in native the signal handler just returns.
>> If we hit the red zone then we should enter fatal error handling but that
>> doesn't seem to happen. I'd need to trace through the signal code to see
>> exactly where we end up.
>>
>> David
>>
>>> -Dmitry
>>>
>>>
>>> On 2011-08-09 12:46, Yasumasa Suenaga wrote:
>>>> Hi, David,
>>>>
>>>> Thank you for checking the history.
>>>>
>>>>> What I can say is that the stack-banging that we do with the guard pages
>>>>> was considered generally more reliable, and could be applied the same
>>>>> way across all platforms. (The Solaris version also dropped all use of
>>>>> alternate signal stacks for other reasons.)
>>>>
>>>> I've understood the history.
>>>> I guess that is "-XX:AltStackSize" .
>>>> http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
>>>>
>>>>
>>>>
>>>> However, at least, VM stack guard page (RedZone: -XX:StackRedPages) does
>>>> not
>>>> work in now implementation (on Linux x86 / AMD64). So, I think that we
>>>> should
>>>> fix this problem to work this function.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Yasumasa
>>>>
>>>> (2011/08/09 17:16), David Holmes wrote:
>>>>> Well I was right about there being history and wrong about the nature of
>>>>> the history. Seems we used alternate signal stacks on Linux up till 1.5
>>>>> when it was explicitly dropped:
>>>>>
>>>>> 4852809: Linux: do not use alternate signal stack
>>>>>
>>>>> Unfortunately that bug is not public so I can't divulge the reasoning
>>>>> behind the change.
>>>>>
>>>>> What I can say is that the stack-banging that we do with the guard pages
>>>>> was considered generally more reliable, and could be applied the same
>>>>> way across all platforms. (The Solaris version also dropped all use of
>>>>> alternate signal stacks for other reasons.)
>>>>>
>>>>> David
>>>>>
>>>>> Yasumasa Suenaga said the following on 08/09/11 17:26:
>>>>>> Hi, David,
>>>>>> Thank you for replying.
>>>>>>
>>>>>> (2011/08/09 15:51), David Holmes wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I could be mistaken here but I believe the intent/hope is that any
>>>>>>> stackoverflow will be caught when the guard pages set up by the VM are
>>>>>>> accessed. In that way we haven't run out of true native stack and so we
>>>>>>> can still process the signal that indicates the stack overflow. This is
>>>>>>> not a perfect mechanism of course and there may be situations where you
>>>>>>> can jump over the guard pages and truly exhaust the stack.
>>>>>>
>>>>>> Yes, I agree.
>>>>>>
>>>>>>> I also believe there is a bit of bad history here, where we had problems
>>>>>>> trying to use alternative signal stacks on Linux. It will take me a bit
>>>>>>> of archaeology to dig up relevant info on that.
>>>>>>
>>>>>> If you've dug up relevant info, please tell me.
>>>>>>
>>>>>>
>>>>>> BTW, my patch provides new VM option "UseAlternateSignalStack" .
>>>>>> If this option sets to false, this patch (sigaltstack) will not work.
>>>>>>
>>>>>> When it is a viewpoint of the troubleshooting, I want to this function.
>>>>>> If I can get hs_err log at native stack overflow, I can certainly suggest
>>>>>> expanding stack area (-Xss).
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Yasumasa
>>>>>>
>>>>>>> David Holmes
>>>>>>>
>>>>>>> Yasumasa Suenaga said the following on 08/09/11 16:06:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I encountered native stack overflow at JNI code on Linux (Fedora 15
>>>>>>>> and Ubuntu 11).
>>>>>>>> I got coredump image, however, I could not get hs_err log.
>>>>>>>>
>>>>>>>> In the case of SIGSEGV, hs_err log is generated in signal handler.
>>>>>>>> If native
>>>>>>>> stack overflow occurred, Linux can't use stack area. So, SIGSEGV
>>>>>>>> handler
>>>>>>>> (JVM_handle_linux_signal) is never called.
>>>>>>>>
>>>>>>>> manpage of sigaltstack(2):
>>>>>>>> /****************/
>>>>>>>> NOTES
>>>>>>>> The most common usage of an alternate signal stack is to handle the
>>>>>>>> SIGSEGV sig‐
>>>>>>>> nal that is generated if the space available for the normal process
>>>>>>>> stack is
>>>>>>>> exhausted: in this case, a signal handler for SIGSEGV cannot be
>>>>>>>> invoked on the
>>>>>>>> process stack; if we wish to handle it, we must use an alternate
>>>>>>>> signal stack.
>>>>>>>> /****************/
>>>>>>>>
>>>>>>>>
>>>>>>>> If this patch is applied, we can get hs_err log on native stack
>>>>>>>> overflow as follows:
>>>>>>>>
>>>>>>>> /****************/
>>>>>>>> #
>>>>>>>> # SIGSEGV (0xb) at pc=0x00007fb23f1265f7, pid=25748,
>>>>>>>> tid=140403650643712
>>>>>>>> # java.lang.StackOverflowError: Native stack
>>>>>>>> #
>>>>>>>> # JRE version: 8.0
>>>>>>>> # Java VM: OpenJDK 64-Bit Server VM (22.0-b01 mixed mode linux-amd64
>>>>>>>> compressed oops)
>>>>>>>> # Problematic frame:
>>>>>>>> # C [liboverflow.so+0x5f7] Java_Main_doStackOverflow+0x3b
>>>>>>>> /****************/
>>>>>>>>
>>>>>>>>
>>>>>>>> I've attached this patch and testcase in this email. Please check it.
>>>>>>>>
>>>>>>>>
>>>>>>>> I would like to contribute this patch, and I hope to apply this
>>>>>>>> patch to
>>>>>>>> JDK 6 / 7 / 8.
>>>>>>>>
>>>>>>>>
>>>>>>>> Please cooperate.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Yasumasa
>>>>>>>>
>>>>>>
>>>
>>>


More information about the hotspot-runtime-dev mailing list