8065585: Change ShouldNotReachHere() to never return
Mikael Gerdin
mikael.gerdin at oracle.com
Mon Apr 20 14:45:21 UTC 2015
On 2015-04-17 17:03, Stefan Karlsson wrote:
> On 2015-04-17 16:55, Mikael Gerdin wrote:
>> On 2015-04-17 14:52, Stefan Karlsson wrote:
>>>
>>>
>>> On 2015-04-17 13:49, Mikael Gerdin wrote:
>>>> On 2015-04-16 15:32, Stefan Karlsson wrote:
>>>>> On 2015-04-16 14:33, David Holmes wrote:
>>>>>> Hi Stefan,
>>>>>>
>>>>>> trimming ...
>>>>>>
>>>>>> On 16/04/2015 10:07 PM, Stefan Karlsson wrote:
>>>>>>> On 2015-04-16 04:23, David Holmes wrote:
>>>>>>>> Second, more important question: have you examined how this
>>>>>>>> attribute
>>>>>>>> affects the ability to walk the stack? We have already seen
>>>>>>>> issues on
>>>>>>>> some platforms where library functions, like abort(), have the
>>>>>>>> noreturn attribute and as a result the call is optimized in a way
>>>>>>>> that
>>>>>>>> prevents the stack from being walked - see eg:
>>>>>>>>
>>>>>>>> https://git.matricom.net/Firmware/bionic/commit/5f32207a3db0bea3ca1c7f4b2b563c11b895f276
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> though this:
>>>>>>>>
>>>>>>>> https://www.raspberrypi.org/forums/viewtopic.php?t=60540&p=451729
>>>>>>>>
>>>>>>>> suggests that problem may have been addressed by the libc folk.
>>>>>>>> But it
>>>>>>>> still raises the question as to how our own noreturn functions
>>>>>>>> will be
>>>>>>>> handled and how they will affect stacktrace generation in hs_err
>>>>>>>> logs
>>>>>>>> or via gdb.
>>>>>>>
>>>>>>> I added a call to fatal(...) in the GC code. I get correct
>>>>>>> stacktraces
>>>>>>> in gdb, but the stacktraces in the hs_err files are broken with
>>>>>>> fastdebug and product builds:
>>>>>>
>>>>>> Which platforms?
>>>>>
>>>>> On Linux x86 and x86_64.
>>>>>
>>>>>>
>>>>>>> Stack: [0x00007f12518d2000,0x00007f12519d3000],
>>>>>>> sp=0x00007f12519d0eb0,
>>>>>>> free space=1019k
>>>>>>> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code,
>>>>>>> C=native code)
>>>>>>> V [libjvm.so+0x11db44a] VMError::report_and_die()+0x1ba
>>>>>>> V [libjvm.so+0x7efb80] report_vm_error(char const*, int, char
>>>>>>> const*,
>>>>>>> char const*)+0x90
>>>>>>> V [libjvm.so+0x7efc49] report_vm_error_noreturn(char const*, int,
>>>>>>> char
>>>>>>> const*, char const*)+0x9
>>>>>>> V [libjvm.so+0x7efc63]
>>>>>>> V [libjvm.so+0xfd7937]
>>>>>>> V [libjvm.so+0xfeec51]
>>>>>>> ...
>>>>>>
>>>>>> So what is the plan: try to get hs_err working again? Or file this
>>>>>> under "well it seemed like a good idea"? ;-)
>>>>>
>>>>> I'm leaning towards "seemed like a good idea", unless someone has an
>>>>> easy fix for these problems.
>>>>
>>>> I've been looking a bit at this. It's not the stack trace per se that
>>>> is broken, but the decoding of the function names is not working for
>>>> some of the callers of the noreturn functions.
>>>>
>>>> I tried this with report_fatal using -XX:ErrorHandlerTest=5 and got
>>>> the following:
>>>>
>>>> 0x7fb71ccd98d0 <report_fatal>: push %rbp
>>>> 0x7fb71ccd98d1 <report_fatal+1>: mov %rdx,%rcx
>>>> 0x7fb71ccd98d4 <report_fatal+4>: lea 0x9b4b34(%rip),%rdx
>>>> 0x7fb71ccd98db <report_fatal+11>: mov %rsp,%rbp
>>>> 0x7fb71ccd98de <report_fatal+14>: callq 0x7fb71ccd98c0
>>>> 0x7fb71ccd98e3: data16 data16 data16 nopw %cs:0x0(%rax,%rax,1)
>>>>
>>>> So the report_fatal frame has ...98e3 as its return address, but that
>>>> is actually outside the function and this causes dladdr() to return
>>>> NULL in dli_saddr and dli_sname.
>>>>
>>>> The JVM then attempts to decode using Decoder::decode but I wasn't
>>>> able to follow that code to understand why that fails.
>>>>
>>>> The same appears to happen for the caller of report_fatal
>>>> (controlled_crash in my case) but there I can't explain why dladdr
>>>> returns NULL values there.
>>>>
>>>> After these two functions the rest of the stack trace appears to be
>>>> correctly decoded.
>>>>
>>>> One approach could be to attempt to inject a "nop" at the end of
>>>> functions which call a "noreturn" function. This would hopefully make
>>>> the instruction after the call to the noreturn function part of the
>>>> caller and would make symbol decoding work.
>>>
>>> I found this mail thread:
>>> https://sourceware.org/bugzilla/show_bug.cgi?id=6522
>>>
>>> which blames the -fcross-jumping optimization.
>>>
>>> I recompiled hotspot with OPT_CFLAGS/debug.o=-fno-crossjumping, and now
>>> I get correct stack traces with fastdebug on Linux 64 bits.
>>
>> I did a more thorough investigation into this on a slowdebug build,
>> and the reason for the symbols missing appears to be that after the
>> JVM's ELF Decoder runs into an un-decodeable symbol because a return
>> PC points to a nop in-between two symbols (because it's just called a
>> noreturn function) the Decoder sets m_status to FileInvalid and
>> refuses to decode any more symbols.
>> If I comment out the code to set the fail status I get a fairly normal
>> hs err stacktrace:
>>
>> V [libjvm.so+0xf184c8] VMError::report(outputStream*)+0x133c
>> V [libjvm.so+0xf19865] VMError::report_and_die()+0x411
>> V [libjvm.so+0x7876de] report_vm_error(char const*, int, char
>> const*, char const*)+0xba
>> V [libjvm.so+0x7877d7] report_vm_error_noreturn(char const*, int,
>> char const*, char const*)+0x3d
>> V [libjvm.so+0x78781b] report_should_not_call(char const*, int)+0x0
>> V [libjvm.so+0x92bfeb]
>> V [libjvm.so+0x6e10ff] GenCollectorPolicy::mem_allocate_work(unsigned
>> long, bool, bool*)+0x283
>> V [libjvm.so+0x92c049] GenCollectedHeap::mem_allocate(unsigned long,
>> bool*)+0x5d
>> V [libjvm.so+0x45dbe5]
>> CollectedHeap::common_mem_allocate_noinit(KlassHandle, unsigned long,
>> Thread*)+0x103
>> V [libjvm.so+0x45dda2]
>> CollectedHeap::common_mem_allocate_init(KlassHandle, unsigned long,
>> Thread*)+0x4e
>> V [libjvm.so+0x45e034] CollectedHeap::array_allocate(KlassHandle,
>> int, int, Thread*)+0xac
>> V [libjvm.so+0xed2f04] TypeArrayKlass::allocate_common(int, bool,
>> Thread*)+0xf0
>> V [libjvm.so+0x44ae3e] TypeArrayKlass::allocate(int, Thread*)+0x3e
>> V [libjvm.so+0xcef2d5] oopFactory::new_typeArray(BasicType, int,
>> Thread*)+0x55
>> V [libjvm.so+0x9c5aa9] InterpreterRuntime::newarray(JavaThread*,
>> BasicType, int)+0x147
>> j alloc.AllocArrays.main([Ljava/lang/String;)V+237
>> v ~StubRoutines::call_stub
>> V [libjvm.so+0x9df121] JavaCalls::call_helper(JavaValue*,
>> methodHandle*, JavaCallArguments*, Thread*)+0x6b1
>> V [libjvm.so+0xd091d7] os::os_exception_wrapper(void (*)(JavaValue*,
>> methodHandle*, JavaCallArguments*, Thread*), JavaValue*,
>> methodHandle*, JavaCallArguments*, Thread*)+0x41
>> V [libjvm.so+0x9dea5a] JavaCalls::call(JavaValue*, methodHandle,
>> JavaCallArguments*, Thread*)+0x86
>> V [libjvm.so+0xa42306] jni_invoke_static(JNIEnv_*, JavaValue*,
>> _jobject*, JNICallType, _jmethodID*, JNI_ArgumentPusher*, Thread*)+0x200
>> V [libjvm.so+0xa5964a] jni_CallStaticVoidMethod+0x353
>> C [libjli.so+0x86ed] JavaMain+0x93c
>> C [libpthread.so.0+0x80a5] start_thread+0xc5
>>
>> One problem is the line
>> V [libjvm.so+0x78781b] report_should_not_call(char const*, int)+0x0
>> I actually added a call to fatal(), but since fatal calls a noreturn
>> function the return pc of that frame accidentally points to the first
>> instruction in the next function, which happens to be
>> report_should_not_call.
>>
>> I wonder if this could be fixed by forcing gcc to empit a nop after
>> the call to report_vm_error_noreturn in report_fatal and friends.
>> __asm__ __volatile__ ("nop" : : :);
>> appears to not be enough. GCC is very aggressive with noreturn, even
>> with -O0.
>
> And the reason why m_status was set to FileInvalid seems to be the bug
> in ElfSymbolTable::lookup, which returns true instead of false if it
> fails to find a symbol!:
>
> bool ElfSymbolTable::lookup(address addr, int* stringtableIndex, int*
> posIndex, int* offset, ElfFuncDescTable* funcDescTable) {
> ...
> return true;
> }
>
> The caller will then think that the symbol was found and use the
> uninitialized output parameters.
Excellent!
I've messed around a bit to try to work around the correctness problem
of the stack trace and I think I have a solution:
By inlining the call to noreturn_function in the macros wrapping
report_* the return PC of the calling frame will never point into space
between functions at the point of the stack walk operation. The return
PC will instead almost always point to a call to noreturn_function,
which will always be a part of the correct callee function.
Webrev (incremental on Stefan's changes):
http://cr.openjdk.java.net/~mgerdin/8065585/webrev.incr/
Full webrev (for completeness):
http://cr.openjdk.java.net/~mgerdin/8065585/webrev.full/
I've manually verified that a call to fatal() at nontrivial stack depth
will generate a correct stack trace on all Oracle supported platforms.
/Mikael
>
> StefanK
>
>>
>> /Mikael
>>
>>>
>>> StefanK
>>>>
>>>> /Mikael
>>>>
>>>>>
>>>>> Thanks,
>>>>> StefanK
>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> David
>>>>>>
>>>>>>> Thanks,
>>>>>>> StefanK
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> StefanK
>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>
More information about the hotspot-dev
mailing list