Can't get hs_err log on native stack overflow on Linux

Thu Aug 11 01:30:55 PDT 2011

(2011/08/11 11:03), David Holmes wrote:
> Yasumasa Suenaga said the following on 08/10/11 17:16:
>>>> 2. signal mask setting
>>>>      In now implementation, SIGSEGV handler is registered with sa_mask sets to full with
>>>>      sigfillset(3) . So, when SIGSEGV handler is invoked, another signal handler is blocked.
>>>>      However, in JVM_handle_linux_signal(), current signal (including SIGSEGV) is set UNBLOCK .
>>>>      Thus, if we remove sigprocmask(2), alternate signal stack works fine (no stack confliction).
>>> Again I don't see how this solves the problem of two threads trying to
>>> use the same stack when they each receive a signal?
>>
>> Yes... You're right.
>>
>> I think solution that I allocate alternate stack area in thread stack with alloca(3) in
>> pthread entry point.
>> In this way, alternate stack is allocated in thread local stack area. So, stack memory
>> corruption will never happen.
>>
>> I make small sample with C . Please check attached source code in this email.
>> Weak point of this sample is that part of native stack area is reserved by sigaltstack(2).
>> However, we can tune thread stack size with -Xss.
>>
>> If this sample code is valid, I will rewrite patch with this way.
> 
> I still don't see how this addresses the Linux bug. Without the patch
> you referred to, a thread created by native code from a Java thread will
> inherit the alternate-stack of the Java thread, regardless of where that
> alternate stack is located.

Hmm, it's very difficult.
I only think of hooking the entry of pthread_create() with LD_PRELOAD.

In modern Linux Kernel, this problem has been fixed.
I run test program on Fedora15 (2.6.40-4.fc15.x86_64) and Ubuntu 11 (2.6.38-10-server),
it works fine (alternate signal stack didn't inherited).

I also checked source code of 2.6.40-4.fc15.x86_64, the following patch was applied.
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.39.y.git;a=commit;h=f9a3879abf2f1a27c39915e6074b8ff15a24cb55

>>> The request is to get a hs_err log produced when native stack overflow occurs.
>>
>> Yes. And I expect that "stack overflow" is explicit in hs_err log.
> 
> Okay. Presently hotspot does not attempt to recognize such problems in
> native code and ignores faults in the yellow zone, for example. So we
> may need to consider an RFE to change this. Also when it does generate
> the hs_err file it doesn't indicate explicitly that it was a
> stackoverflow (the printf from the signal handler is isolated from the
> hs_err file).
> 
>>> So what path in the current VM logic would you expect to be followed such that
>>> the hs_err log is produced? Is it the red-zone fault detection logic?
>>
>> In most case, I think that we can use red-zone fault detection logic.
>>
>> Part of os_linux_x86.cpp:
>> /********************/
>>          } else if (thread->in_stack_red_zone(addr)) {
>>            // Fatal red zone violation.  Disable the guard pages and fall through
>>            // to handle_unexpected_exception way down below.
>>            thread->disable_stack_red_zone();
>>            tty->print_raw_cr("An irrecoverable stack overflow has occurred.");
>> /********************/
>>
>> I guess that the programmer who wrote this code expected to use red-zone stack.
>> ( disable_stack_red_zone() )
>> However, signal handler is never called. Detection logic is valid.
>> (but incompletely: this logic can't detect larger stack overflow.)
> 
> In relation to your original test case, in the loop you allocated 1K per
> iteration. I don't see in this case how this is too large for the red
> zone detection. Granted we only use one red page by default but that
> should be at least 4K these days and likely bigger. Plus I just tried
> bumping up the red page count to 100 (with 512K stack) and it didn't
> change the behaviour!
> 
> So in my mind the current code is not working as I would (perhaps
> incorrectly) expect. On top of that the current code is limited in that
> it will only detect faults in the yellow/red zones and so will not cover
> all cases - but it may be that if the yellow/red zone detection actually
> works as I expect then we won't need to cover all cases.

I agree.
I think that generating of hs_err log which triggered by native stack overflow
in yellow/red zone will should be only covered.

> Also please note that while I enjoy discussing these things that doesn't
> imply that this will automatically be added as either a bug or a RFE, or
> that I will be able to investigate why things do not seem to be working
> as I currently expect. Also I'm about to commence vacation until August
> 21 (though I will be checking email when feasible).

All right.
However, I wish this issue to be listed in RFE.

Also I will take vacation from 13 Aug to 22.
I will check hotspot-runtime-dev through ML archive :-)

Have a good vacation!

Yasumasa

> Cheers,
> David
> 
>>
>> Thanks,
>>
>> Yasumasa
>>
>>
>> (2011/08/10 13:35), David Holmes wrote:
>>> Yasumasa Suenaga said the following on 08/10/11 13:06:
>>>> I found info as follows. Do you mean it?
>>>> http://us.generation-nt.com/patch-fix-sigaltstack-corruption-among-cloned-threads-help-180626641.html
>>>>
>>>>
>>>> If "Linux bug" means this, I think that we can approach with 2 ways.
>>> This is one issue.
>>>
>>>> 1. Prepare alternate signal stack per each threads
>>>>      We call sigaltstack(2) in pthread entry point ( static void *java_start(Thread *thread) ),
>>>>      and register memory free routine with pthread_cleanup_push() / pthread_cleanup_pop() .
>>>>      In this way, alternate signal stack is available in each threads, however, we need more
>>>>      memory.
>>> I don't see how this helps with the bug referred to. If the Java thread
>>> calls native code that creates a native thread then they will share the
>>> same alternate-stack.
>>>
>>> The significance of the cost per thread must not be under-estimated.
>>> There are users out there that run their VMs on the edge of its limits.
>>> They tune stack sizes to maximize the number of threads and a sudden
>>> addition of 40K per thread would bring their applications crashing down.
>>> I realize your proposal is to control this via a flag (default off) but
>>> I wanted to stress the memory concern.
>>>
>>>> 2. signal mask setting
>>>>      In now implementation, SIGSEGV handler is registered with sa_mask sets to full with
>>>>      sigfillset(3) . So, when SIGSEGV handler is invoked, another signal handler is blocked.
>>>>      However, in JVM_handle_linux_signal(), current signal (including SIGSEGV) is set UNBLOCK .
>>>>      Thus, if we remove sigprocmask(2), alternate signal stack works fine (no stack confliction).
>>> Again I don't see how this solves the problem of two threads trying to
>>> use the same stack when they each receive a signal?
>>>
>>>
>>> Reading our internal info more carefully it seems that we were only
>>> using the alternate stack as a means of dealing with SEGVs that occurred
>>> as part of the stack-banging process, not as some more general
>>> stackoverflow management approach. So it's use in that old form would
>>> not, I believe, correct the current situation.
>>>
>>> I need to take a step back here to clearly understand the actual
>>> problem. The request is to get a hs_err log produced when native stack
>>> overflow occurs. As I've said I'm not sure that the VM will even attempt
>>> to do that in general today. So what path in the current VM logic would
>>> you expect to be followed such that the hs_err log is produced? Is it
>>> the red-zone fault detection logic?
>>>
>>> Aside: another potential issue with alternate-stacks is that many signal
>>> handlers don't follow the rules about only calling signal-safe functions
>>> and some functions may depend on stack information that is invalidated
>>> if used on an alternate-stack. (For example the old LinuxThreads
>>> implementations used a shift of the stack address as the key to identify
>>> the current thread - not that we need be concerned with LinuxThreads
>>> these days, but applications might use similar tricks).
>>>
>>> But as I said I'd like to understand exactly what it is we think should
>>> be happening, and why it is not, before determining whether
>>> alternate-stacks is a potential solution or not.
>>>
>>> Thanks,
>>> David
>>>
>>>
>>>> /************************/
>>>>     // unmask current signal
>>>>     sigset_t newset;
>>>>     sigemptyset(&newset);
>>>>     sigaddset(&newset, sig);
>>>>     sigprocmask(SIG_UNBLOCK,&newset, NULL);
>>>>
>>>>     VMError err(t, sig, pc, info, ucVoid);
>>>>     err.report_and_die();
>>>>
>>>>     ShouldNotReachHere();
>>>> /************************/
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Yasumasa
>>>>
>>>> (2011/08/09 18:59), Coleen Phillimore wrote:
>>>>> To answer my own question, alternate signal stacks consumed more memory
>>>>> and decreased the number of threads that can be created (if I'm reading
>>>>> this correctly).
>>>>>
>>>>> Coleen
>>>>>
>>>>> On 8/9/2011 5:47 AM, Coleen Phillimore wrote:
>>>>>> To handle large native stacks, you have to increase the StackShadowPages
>>>>>> so that they cover the estimated size of the native stacks.
>>>>>> StackRedPages and StackYellowPages should stay the same. That's how the
>>>>>> design is supposed to work, and it should work correctly on linux x86
>>>>>> and arm. If you have an infinite recursion on native frames you should
>>>>>> see that in a core file, as you would in a C or C++ implementation. The
>>>>>> JVM is only trying to handle Java stack overflows and tolerate native
>>>>>> code mixed in.
>>>>>>
>>>>>> That said, I don't know why these linux alternate signal stacks were so
>>>>>> buggy or what versions of linux they were buggy on. Maybe it is worth
>>>>>> having this change if we can resolve it.
>>>>>>
>>>>>> Coleen
>>>>>>
>>>>>> On 8/9/2011 4:46 AM, Yasumasa Suenaga wrote:
>>>>>>> Hi, David,
>>>>>>>
>>>>>>> Thank you for checking the history.
>>>>>>>
>>>>>>>> What I can say is that the stack-banging that we do with the guard pages
>>>>>>>> was considered generally more reliable, and could be applied the same
>>>>>>>> way across all platforms. (The Solaris version also dropped all use of
>>>>>>>> alternate signal stacks for other reasons.)
>>>>>>> I've understood the history.
>>>>>>> I guess that is "-XX:AltStackSize" .
>>>>>>> http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
>>>>>>>
>>>>>>>
>>>>>>> However, at least, VM stack guard page (RedZone: -XX:StackRedPages) does not
>>>>>>> work in now implementation (on Linux x86 / AMD64). So, I think that we should
>>>>>>> fix this problem to work this function.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Yasumasa
>>>>>>>
>>>>>>> (2011/08/09 17:16), David Holmes wrote:
>>>>>>>> Well I was right about there being history and wrong about the nature of
>>>>>>>> the history. Seems we used alternate signal stacks on Linux up till 1.5
>>>>>>>> when it was explicitly dropped:
>>>>>>>>
>>>>>>>> 4852809: Linux: do not use alternate signal stack
>>>>>>>>
>>>>>>>> Unfortunately that bug is not public so I can't divulge the reasoning
>>>>>>>> behind the change.
>>>>>>>>
>>>>>>>> What I can say is that the stack-banging that we do with the guard pages
>>>>>>>> was considered generally more reliable, and could be applied the same
>>>>>>>> way across all platforms. (The Solaris version also dropped all use of
>>>>>>>> alternate signal stacks for other reasons.)
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> Yasumasa Suenaga said the following on 08/09/11 17:26:
>>>>>>>>> Hi, David,
>>>>>>>>> Thank you for replying.
>>>>>>>>>
>>>>>>>>> (2011/08/09 15:51), David Holmes wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I could be mistaken here but I believe the intent/hope is that any
>>>>>>>>>> stackoverflow will be caught when the guard pages set up by the VM are
>>>>>>>>>> accessed. In that way we haven't run out of true native stack and so we
>>>>>>>>>> can still process the signal that indicates the stack overflow. This is
>>>>>>>>>> not a perfect mechanism of course and there may be situations where you
>>>>>>>>>> can jump over the guard pages and truly exhaust the stack.
>>>>>>>>> Yes, I agree.
>>>>>>>>>
>>>>>>>>>> I also believe there is a bit of bad history here, where we had problems
>>>>>>>>>> trying to use alternative signal stacks on Linux. It will take me a bit
>>>>>>>>>> of archaeology to dig up relevant info on that.
>>>>>>>>> If you've dug up relevant info, please tell me.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> BTW, my patch provides new VM option "UseAlternateSignalStack" .
>>>>>>>>> If this option sets to false, this patch (sigaltstack) will not work.
>>>>>>>>>
>>>>>>>>> When it is a viewpoint of the troubleshooting, I want to this function.
>>>>>>>>> If I can get hs_err log at native stack overflow, I can certainly suggest
>>>>>>>>> expanding stack area (-Xss).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Yasumasa
>>>>>>>>>
>>>>>>>>>> David Holmes
>>>>>>>>>>
>>>>>>>>>> Yasumasa Suenaga said the following on 08/09/11 16:06:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I encountered native stack overflow at JNI code on Linux (Fedora 15 and Ubuntu 11).
>>>>>>>>>>> I got coredump image, however, I could not get hs_err log.
>>>>>>>>>>>
>>>>>>>>>>> In the case of SIGSEGV, hs_err log is generated in signal handler. If native
>>>>>>>>>>> stack overflow occurred, Linux can't use stack area. So, SIGSEGV handler
>>>>>>>>>>> (JVM_handle_linux_signal) is never called.
>>>>>>>>>>>
>>>>>>>>>>> manpage of sigaltstack(2):
>>>>>>>>>>> /****************/
>>>>>>>>>>> NOTES
>>>>>>>>>>>             The most common usage of an alternate signal stack is to handle the SIGSEGV sig‐
>>>>>>>>>>>             nal that is generated if the space available for the  normal  process  stack  is
>>>>>>>>>>>             exhausted:  in  this case, a signal handler for SIGSEGV cannot be invoked on the
>>>>>>>>>>>             process stack; if we wish to handle it, we must use an alternate signal stack.
>>>>>>>>>>> /****************/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If this patch is applied, we can get hs_err log on native stack overflow as follows:
>>>>>>>>>>>
>>>>>>>>>>> /****************/
>>>>>>>>>>> #
>>>>>>>>>>> #  SIGSEGV (0xb) at pc=0x00007fb23f1265f7, pid=25748, tid=140403650643712
>>>>>>>>>>> #  java.lang.StackOverflowError: Native stack
>>>>>>>>>>> #
>>>>>>>>>>> # JRE version: 8.0
>>>>>>>>>>> # Java VM: OpenJDK 64-Bit Server VM (22.0-b01 mixed mode linux-amd64 compressed oops)
>>>>>>>>>>> # Problematic frame:
>>>>>>>>>>> # C  [liboverflow.so+0x5f7]  Java_Main_doStackOverflow+0x3b
>>>>>>>>>>> /****************/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've attached this patch and testcase in this email. Please check it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I would like to contribute this patch, and I hope to apply this patch to
>>>>>>>>>>> JDK 6 / 7 / 8.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Please cooperate.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Yasumasa
>>>>>>>>>>>