Need help on debugging JVM crash

Wed Mar 4 08:37:03 UTC 2020

FWIW, I see that this is run with -XX:-OmitStackTraceInFastThrowFalse. 
Maybe there's a problem with that flag?

Some more info from the hs_err file that could further clues to the problem:

The Java thread the GC is scanning is creating a ThrowableProxy, and is 
in the process of taking a slow path to allocate an array. Looking at 
the code it seems like it first calls Thread.getStackTrace(), and then 
creates an array of proxies to those elements.

One of the hs_err files report over > 800 OutOfMemoryErrors.

StefanK

On 2020-03-03 21:12, Ioi Lam wrote:
> The crash happened while the GC is running. I tried disasm of the 
> crashing address PCMarkAndPushClosure::do_oop(oopDesc**)+0x51 (from 
> the 13.0.1+9 GA binaries of AdoptJDK)
>
> (gdb) x/80i _ZN20PCMarkAndPushClosure6do_oopEPP7oopDesc
>    <>:    push   %rbp
>    <+1>:    mov    %rsp,%rbp
>    <+4>:    push   %r13
>    <+6>:    push   %r12
>    <+8>:    push   %rbx
>    <+9>:    sub    $0x8,%rsp
>    <+13>:    mov    (%rsi),%rbx  ;;; rbx = oop
>    <+16>:    test   %rbx,%rbx    ;;; oop != null?
>    <+19>:    je     0x7ffff67ca317 
> <_ZN20PCMarkAndPushClosure6do_oopEPP7oopDesc+87>
>    <+21>:    lea    0x9c0afc(%rip),%rax        # 0x7ffff718add8 
> <_ZN20ParCompactionManager12_mark_bitmapE>
>    <+28>:    mov    %rbx,%rcx    ;;; rcx = oop
>    <+31>:    mov    (%rax),%rdx  ;;; rdx = 
> ParCompactionManager::_mark_bitmap
>    <+34>:    sub    (%rdx),%rcx  ;;; rcx = oop - 
> _mark_bitmap->_region_start
>    <+37>:    mov    0x10(%rdx),%rdx ;; rdx = 
> _mark_bitmap->_beg_bits->_map
>    <+41>:    mov    %rcx,%rax   ;;;  rax = oop - 
> _mark_bitmap->_region_start
>    <+44>:    lea    0x93b935(%rip),%rcx       # 0x7ffff7105c28 
> <LogMinObjAlignment>
>    <+51>:    shr    $0x3,%rax
>    <+55>:    mov    (%rcx),%ecx
>    <+57>:    shr    %cl,%rax
>    <+60>:    mov    %rax,%rcx
>    <+63>:    mov    %rax,%rsi      ;;; rsi = index of oop inside 
> mark_bitmap
>    <+66>:    mov    $0x1,%eax
>    <+71>:    and    $0x3f,%ecx
>    <+74>:    shr    $0x6,%rsi
>    <+78>:    shl    %cl,%rax
>    <+81>:    test   %rax,(%rdx,%rsi,8) << crash
>
>
> This looks like that the oop that we try to mark is actually outside 
> of the heap range, so trying to mark it in the mark_bitmap causes this:
>
>
>    siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr:
>    0x0000000000000000
>
>
> Here are the values of the registers for the "test" instruction above:
>
>
>     RAX=0x0000000000000001 is an unknown value
>     RDX=0x00007f55af000000 points into unknown readable memory: 01 00 
> 00 00 01 00 00 04
>     RSI=0x007fffc05491d000 is an unknown value
>
>
> As you can see, RSI is very large, which means you have an invalid oop 
> in the stack that's probably very large.
>
>
>     [libjvm.so+0xcc0121] PCMarkAndPushClosure::do_oop(oopDesc**)+0x51
>     [libjvm.so+0xc58c8b]  OopMapSet::oops_do()
>     [libjvm.so+0x7521e9]  frame::oops_do_internal()+0x99 <<<< HERE
>     [libjvm.so+0xf55757]  JavaThread::oops_do()+0x187
>
>
> As others have mentioned, this kind of error is usually caused by 
> invalid use of Unsafe or JNI that leads to heap corruption. However, 
> it's plausible that somehow the VM has messed up the frame and tries 
> to mark an invalid oop.
>
> Thanks
> - Ioi
>
>
> On 3/3/20 8:02 AM, Sundara Mohan M wrote:
>> Hi,
>>      I am seeing JVM crashes on our system in GC Thread with parallel 
>> gc on
>> x86 linux. Observed the same crash happening on 
>> JVM-11.0.6/13.0.2/13.0.1 GA
>> builds.
>> Adding some logs lines to give some context.
>>
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGSEGV (0xb) at pc=0x00007f669c964311, pid=66684, tid=71106
>> #
>> # JRE version: OpenJDK Runtime Environment (13.0.1+9) (build 13.0.1+9)
>> # Java VM: OpenJDK 64-Bit Server VM (13.0.1+9, mixed mode, tiered, 
>> parallel
>> gc, linux-amd64)
>> # Problematic frame:
>> # V  [libjvm.so+0xcd3311] PCMarkAndPushClosure::do_oop(oopDesc**)+0x51
>> #
>> # No core dump will be written. Core dumps have been disabled. To enable
>> core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # If you would like to submit a bug report, please visit:
>> #   https://github.com/AdoptOpenJDK/openjdk-build/issues
>> #
>>
>> Host: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 48 cores, 125G, Red Hat
>> Enterprise Linux Server release 6.10 (Santiago)
>> Time: Thu Feb  6 11:43:48 2020 UTC elapsed time: 198626 seconds (2d 
>> 7h 10m
>> 26s)
>>
>>
>> Following is the stack trace
>> ex1:
>> Stack: [0x00007fd01cbdb000,0x00007fd01ccdb000], sp=0x00007fd01ccd8890,
>>   free space=1014k
>> Native frames: (J=compiled Java code, A=aot compiled Java code,
>> j=interpreted, Vv=VM code, C=native code)
>> *V  [libjvm.so+0xcc0121] PCMarkAndPushClosure::do_oop(oopDesc**)+0x51*
>> V  [libjvm.so+0xc58c8b]  OopMapSet::oops_do(frame const*, RegisterMap
>> const*, OopClosure*)+0x2eb
>> V  [libjvm.so+0x7521e9]  frame::oops_do_internal(OopClosure*,
>> CodeBlobClosure*, RegisterMap*, bool)+0x99
>> V  [libjvm.so+0xf55757]  JavaThread::oops_do(OopClosure*,
>> CodeBlobClosure*)+0x187
>> V  [libjvm.so+0xcbb100] ThreadRootsMarkingTask::do_it(GCTaskManager*,
>> unsigned int)+0xb0
>> V  [libjvm.so+0x7e0f8b]  GCTaskThread::run()+0x1eb
>> V  [libjvm.so+0xf5d43d]  Thread::call_run()+0x10d
>> V  [libjvm.so+0xc74337]  thread_native_entry(Thread*)+0xe7
>>
>> JavaThread 0x00007fbeb9209800 (nid = 82380) was being processed
>> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
>> v  ~RuntimeStub::_new_array_Java
>> J 62465 c2
>> ch.qos.logback.classic.spi.ThrowableProxy.<init>(Ljava/lang/Throwable;)V
>> (207 bytes) @ 0x00007fd00ad43704 [0x00007fd00ad41420+0x00000000000022e4]
>> J 474206 c2
>> org.eclipse.jetty.util.log.JettyAwareLogger.log(Lorg/slf4j/Marker;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V 
>>
>> (134 bytes) @ 0x00007fd00c4e81ec [0x00007fd00c4e7ee0+0x000000000000030c]
>> j
>> org.eclipse.jetty.util.log.JettyAwareLogger.warn(Ljava/lang/String;Ljava/lang/Throwable;)V+7
>> j
>> org.eclipse.jetty.util.log.Slf4jLog.warn(Ljava/lang/String;Ljava/lang/Throwable;)V+6
>> j
>> org.eclipse.jetty.server.HttpChannel.handleException(Ljava/lang/Throwable;)V+181
>> j
>> org.eclipse.jetty.server.HttpChannelOverHttp.handleException(Ljava/lang/Throwable;)V+13
>> J 64106 c2 org.eclipse.jetty.server.HttpChannel.handle()Z (997 bytes) @
>> 0x00007fd00c6d2cd4 [0x00007fd00c6cdec0+0x0000000000004e14]
>> J 280430 c2 org.eclipse.jetty.server.HttpConnection.onFillable()V (334
>> bytes) @ 0x00007fd00da925f0 [0x00007fd00da91e40+0x00000000000007b0]
>> J 41979 c2 org.eclipse.jetty.io.ChannelEndPoint$2.run()V (12 bytes) @
>> 0x00007fd00a14f604 [0x00007fd00a14f4e0+0x0000000000000124]
>> J 86362 c2 org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()V
>> (565 bytes) @ 0x00007fd0087d7e34 [0x00007fd0087d7cc0+0x0000000000000174]
>> J 75998 c2 java.lang.Thread.run()V java.base at 13.0.2 (17 bytes) @
>> 0x00007fd00c93b8d8 [0x00007fd00c93b8a0+0x0000000000000038]
>> v  ~StubRoutines::call_stub
>>
>> ex2:
>> Stack: [0x00007f669869f000,0x00007f669879f000], sp=0x00007f669879c890,
>>   free space=1014k
>> Native frames: (J=compiled Java code, A=aot compiled Java code,
>> j=interpreted, Vv=VM code, C=native code)
>>
>> *V  [libjvm.so+0xcd3311] PCMarkAndPushClosure::do_oop(oopDesc**)+0x51*V
>>   [libjvm.so+0xc6bf0b]  OopMapSet::oops_do(frame const*, RegisterMap 
>> const*,
>> OopClosure*)+0x2eb
>> V  [libjvm.so+0x765489]  frame::oops_do_internal(OopClosure*,
>> CodeBlobClosure*, RegisterMap*, bool)+0x99
>> V  [libjvm.so+0xf68b17]  JavaThread::oops_do(OopClosure*,
>> CodeBlobClosure*)+0x187
>> V  [libjvm.so+0xcce2f0] ThreadRootsMarkingTask::do_it(GCTaskManager*,
>> unsigned int)+0xb0
>> V  [libjvm.so+0x7f422b]  GCTaskThread::run()+0x1eb
>> V  [libjvm.so+0xf707fd]  Thread::call_run()+0x10d
>> V  [libjvm.so+0xc875b7]  thread_native_entry(Thread*)+0xe7
>>
>> JavaThread 0x00007f5518004000 (nid = 75659) was being processed
>> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
>> v  ~RuntimeStub::_new_array_Java
>> J 54174 c2
>> ch.qos.logback.classic.spi.ThrowableProxy.<init>(Ljava/lang/Throwable;)V
>> (207 bytes) @ 0x00007f6687d92678 [0x00007f6687d8c700+0x0000000000005f78]
>> J 334031 c2
>> com.xmas.webservice.exception.ExceptionLoggingWrapper.execute()V (1004
>> bytes) @ 0x00007f6686ede430 [0x00007f6686edd580+0x0000000000000eb0]
>> J 53431 c2
>> com.xmas.webservice.exception.mapper.AbstractExceptionMapper.toResponse(Lcom/xmas/beans/exceptions/mapper/V3ErrorCode;Ljava/lang/Exception;)Ljavax/ws/rs/core/Response; 
>>
>> (105 bytes) @ 0x00007f6687db88b0 [0x00007f6687db8660+0x0000000000000250]
>> J 63819 c2
>> com.xmas.webservice.exception.mapper.RequestBlockedExceptionMapper.toResponse(Ljava/lang/Exception;)Ljavax/ws/rs/core/Response; 
>>
>> (9 bytes) @ 0x00007f6686a6ed9c [0x00007f6686a6ecc0+0x00000000000000dc]
>> J 334032 c2
>> com.xmas.webservice.filters.ResponseSerializationWorker.processException()Ljava/io/InputStream; 
>>
>> (332 bytes) @ 0x00007f668992ad34 [0x00007f668992a840+0x00000000000004f4]
>> J 403918 c2
>> com.xmas.webservice.filters.ResponseSerializationWorker.execute()Z (272
>> bytes) @ 0x00007f66869d67fc [0x00007f66869d5e80+0x000000000000097c]
>> J 17530 c2 
>> com.lafaspot.common.concurrent.internal.WorkerWrapper.execute()Z
>> (208 bytes) @ 0x00007f66848b3708 [0x00007f66848b36a0+0x0000000000000068]
>> J 31970% c2
>> com.lafaspot.common.concurrent.internal.WorkerManagerOneThread.call()Lcom/lafaspot/common/concurrent/internal/WorkerManagerState; 
>>
>> (486 bytes) @ 0x00007f668608dcb0 [0x00007f668608d5e0+0x00000000000006d0]
>> j
>> com.lafaspot.common.concurrent.internal.WorkerManagerOneThread.call()Ljava/lang/Object;+1
>> J 4889 c1 java.util.concurrent.FutureTask.run()V java.base at 13.0.1 (123
>> bytes) @ 0x00007f667d0be604 [0x00007f667d0bdf80+0x0000000000000684]
>> J 7487 c1
>> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V 
>>
>> java.base at 13.0.1 (187 bytes) @ 0x00007f667dd45854
>> [0x00007f667dd44a60+0x0000000000000df4]
>> J 7486 c1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V
>> java.base at 13.0.1 (9 bytes) @ 0x00007f667d1f643c
>> [0x00007f667d1f63c0+0x000000000000007c]
>> J 7078 c1 java.lang.Thread.run()V java.base at 13.0.1 (17 bytes) @
>> 0x00007f667d1f2d74 [0x00007f667d1f2c40+0x0000000000000134]
>> v  ~StubRoutines::call_stub
>>
>> Not very frequent but ~90 days ~120 crashes with following signal
>> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr:
>> 0x0000000000000000
>> This signal is generated when we try to access non canonical address in
>> linux.
>>
>> As suggested by Stefan in another thread i tried to
>> add VerifyAfterGC/VerifyBeforeGC but it seems to increase the latency 
>> and
>> applications not surviving our production traffic(timing out and 
>> requests
>> are failing).
>>
>> Questions
>> 1. When i looked at source code for printing stack trace i see following
>> https://github.com/openjdk/jdk11u/blob/master/src/hotspot/share/utilities/vmError.cpp#L696 
>>
>> (Prints native stack trace)
>> https://github.com/openjdk/jdk11u/blob/master/src/hotspot/share/utilities/vmError.cpp#L718 
>>
>> (printing Java thread stack trace if it is involved in GC crash)
>>    a. How do you know this java thread was involved in jvm crash?
>>    b. Can i assume the java thread printed after native stack trace 
>> was the
>> culprit?
>>    c. Since i am seeing the same frame (~RuntimeStub::_new_array_Java, J
>> 54174 c2 ch.qos.logback.classic.spi.ThrowableProxy.<init>..) but 
>> different
>> stack trace in both crashes can this be the root cause?
>>
>> 2. Thinking of excluding compilation
>> of ch.qos.logback.classic.spi.ThrowableProxy class and running in
>> production to see if compilation of this method is the cause. Does it 
>> make
>> sense?
>>
>> 3. Any other suggestion on debugging this further?
>>
>> TIA
>> Sundar
>