Need help on debugging JVM crash

Wed Mar 4 01:01:28 UTC 2020

Hi Ioi,

Thanks for the analysis.

On Tue, Mar 3, 2020 at 3:12 PM Ioi Lam <ioi.lam at oracle.com> wrote:

> The crash happened while the GC is running. I tried disasm of the
> crashing address PCMarkAndPushClosure::do_oop(oopDesc**)+0x51 (from the
> 13.0.1+9 GA binaries of AdoptJDK)
>
> (gdb) x/80i _ZN20PCMarkAndPushClosure6do_oopEPP7oopDesc
>     <>:    push   %rbp
>     <+1>:    mov    %rsp,%rbp
>     <+4>:    push   %r13
>     <+6>:    push   %r12
>     <+8>:    push   %rbx
>     <+9>:    sub    $0x8,%rsp
>     <+13>:    mov    (%rsi),%rbx  ;;; rbx = oop
>     <+16>:    test   %rbx,%rbx    ;;; oop != null?
>     <+19>:    je     0x7ffff67ca317
> <_ZN20PCMarkAndPushClosure6do_oopEPP7oopDesc+87>
>     <+21>:    lea    0x9c0afc(%rip),%rax        # 0x7ffff718add8
> <_ZN20ParCompactionManager12_mark_bitmapE>
>     <+28>:    mov    %rbx,%rcx    ;;; rcx = oop
>     <+31>:    mov    (%rax),%rdx  ;;; rdx =
> ParCompactionManager::_mark_bitmap
>     <+34>:    sub    (%rdx),%rcx  ;;; rcx = oop -
> _mark_bitmap->_region_start
>     <+37>:    mov    0x10(%rdx),%rdx ;; rdx = _mark_bitmap->_beg_bits->_map
>     <+41>:    mov    %rcx,%rax   ;;;  rax = oop -
> _mark_bitmap->_region_start
>     <+44>:    lea    0x93b935(%rip),%rcx       # 0x7ffff7105c28
> <LogMinObjAlignment>
>     <+51>:    shr    $0x3,%rax
>     <+55>:    mov    (%rcx),%ecx
>     <+57>:    shr    %cl,%rax
>     <+60>:    mov    %rax,%rcx
>     <+63>:    mov    %rax,%rsi      ;;; rsi = index of oop inside
> mark_bitmap
>     <+66>:    mov    $0x1,%eax
>     <+71>:    and    $0x3f,%ecx
>     <+74>:    shr    $0x6,%rsi
>     <+78>:    shl    %cl,%rax
>     <+81>:    test   %rax,(%rdx,%rsi,8) << crash
>
>
> This looks like that the oop that we try to mark is actually outside of
> the heap range, so trying to mark it in the mark_bitmap causes this:
>
>
>     siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr:
>     0x0000000000000000
>
>
> Here are the values of the registers for the "test" instruction above:
>
>
>      RAX=0x0000000000000001 is an unknown value
>      RDX=0x00007f55af000000 points into unknown readable memory: 01 00
> 00 00 01 00 00 04
>      RSI=0x007fffc05491d000 is an unknown value
>
>
> As you can see, RSI is very large, which means you have an invalid oop
> in the stack that's probably very large.
>
  Can you please explain "stack" means here? Is it functions stack variable
or some thing which GC internally uses?

>
>
>      [libjvm.so+0xcc0121] PCMarkAndPushClosure::do_oop(oopDesc**)+0x51
>      [libjvm.so+0xc58c8b]  OopMapSet::oops_do()
>      [libjvm.so+0x7521e9]  frame::oops_do_internal()+0x99 <<<< HERE
>      [libjvm.so+0xf55757]  JavaThread::oops_do()+0x187
>
>
> As others have mentioned, this kind of error is usually caused by
> invalid use of Unsafe or JNI that leads to heap corruption. However,
> it's plausible that somehow the VM has messed up the frame and tries to
> mark an invalid oop.
>
  Was trying to avoid using JNI calls to check if that is the cause but
that seems not an option for now.
Do you think any other way to get the root cause for this?

> Thanks
> - Ioi
>
>
> On 3/3/20 8:02 AM, Sundara Mohan M wrote:
> > Hi,
> >      I am seeing JVM crashes on our system in GC Thread with parallel gc
> on
> > x86 linux. Observed the same crash happening on JVM-11.0.6/13.0.2/13.0.1
> GA
> > builds.
> > Adding some logs lines to give some context.
> >
> > #
> > # A fatal error has been detected by the Java Runtime Environment:
> > #
> > #  SIGSEGV (0xb) at pc=0x00007f669c964311, pid=66684, tid=71106
> > #
> > # JRE version: OpenJDK Runtime Environment (13.0.1+9) (build 13.0.1+9)
> > # Java VM: OpenJDK 64-Bit Server VM (13.0.1+9, mixed mode, tiered,
> parallel
> > gc, linux-amd64)
> > # Problematic frame:
> > # V  [libjvm.so+0xcd3311]  PCMarkAndPushClosure::do_oop(oopDesc**)+0x51
> > #
> > # No core dump will be written. Core dumps have been disabled. To enable
> > core dumping, try "ulimit -c unlimited" before starting Java again
> > #
> > # If you would like to submit a bug report, please visit:
> > #   https://github.com/AdoptOpenJDK/openjdk-build/issues
> > #
> >
> > Host: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 48 cores, 125G, Red Hat
> > Enterprise Linux Server release 6.10 (Santiago)
> > Time: Thu Feb  6 11:43:48 2020 UTC elapsed time: 198626 seconds (2d 7h
> 10m
> > 26s)
> >
> >
> > Following is the stack trace
> > ex1:
> > Stack: [0x00007fd01cbdb000,0x00007fd01ccdb000],  sp=0x00007fd01ccd8890,
> >   free space=1014k
> > Native frames: (J=compiled Java code, A=aot compiled Java code,
> > j=interpreted, Vv=VM code, C=native code)
> > *V  [libjvm.so+0xcc0121]  PCMarkAndPushClosure::do_oop(oopDesc**)+0x51*
> > V  [libjvm.so+0xc58c8b]  OopMapSet::oops_do(frame const*, RegisterMap
> > const*, OopClosure*)+0x2eb
> > V  [libjvm.so+0x7521e9]  frame::oops_do_internal(OopClosure*,
> > CodeBlobClosure*, RegisterMap*, bool)+0x99
> > V  [libjvm.so+0xf55757]  JavaThread::oops_do(OopClosure*,
> > CodeBlobClosure*)+0x187
> > V  [libjvm.so+0xcbb100]  ThreadRootsMarkingTask::do_it(GCTaskManager*,
> > unsigned int)+0xb0
> > V  [libjvm.so+0x7e0f8b]  GCTaskThread::run()+0x1eb
> > V  [libjvm.so+0xf5d43d]  Thread::call_run()+0x10d
> > V  [libjvm.so+0xc74337]  thread_native_entry(Thread*)+0xe7
> >
> > JavaThread 0x00007fbeb9209800 (nid = 82380) was being processed
> > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> > v  ~RuntimeStub::_new_array_Java
> > J 62465 c2
> > ch.qos.logback.classic.spi.ThrowableProxy.<init>(Ljava/lang/Throwable;)V
> > (207 bytes) @ 0x00007fd00ad43704 [0x00007fd00ad41420+0x00000000000022e4]
> > J 474206 c2
> >
> org.eclipse.jetty.util.log.JettyAwareLogger.log(Lorg/slf4j/Marker;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
> > (134 bytes) @ 0x00007fd00c4e81ec [0x00007fd00c4e7ee0+0x000000000000030c]
> > j
> >
>  org.eclipse.jetty.util.log.JettyAwareLogger.warn(Ljava/lang/String;Ljava/lang/Throwable;)V+7
> > j
> >
>  org.eclipse.jetty.util.log.Slf4jLog.warn(Ljava/lang/String;Ljava/lang/Throwable;)V+6
> > j
> >
>  org.eclipse.jetty.server.HttpChannel.handleException(Ljava/lang/Throwable;)V+181
> > j
> >
>  org.eclipse.jetty.server.HttpChannelOverHttp.handleException(Ljava/lang/Throwable;)V+13
> > J 64106 c2 org.eclipse.jetty.server.HttpChannel.handle()Z (997 bytes) @
> > 0x00007fd00c6d2cd4 [0x00007fd00c6cdec0+0x0000000000004e14]
> > J 280430 c2 org.eclipse.jetty.server.HttpConnection.onFillable()V (334
> > bytes) @ 0x00007fd00da925f0 [0x00007fd00da91e40+0x00000000000007b0]
> > J 41979 c2 org.eclipse.jetty.io.ChannelEndPoint$2.run()V (12 bytes) @
> > 0x00007fd00a14f604 [0x00007fd00a14f4e0+0x0000000000000124]
> > J 86362 c2 org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()V
> > (565 bytes) @ 0x00007fd0087d7e34 [0x00007fd0087d7cc0+0x0000000000000174]
> > J 75998 c2 java.lang.Thread.run()V java.base at 13.0.2 (17 bytes) @
> > 0x00007fd00c93b8d8 [0x00007fd00c93b8a0+0x0000000000000038]
> > v  ~StubRoutines::call_stub
> >
> > ex2:
> > Stack: [0x00007f669869f000,0x00007f669879f000],  sp=0x00007f669879c890,
> >   free space=1014k
> > Native frames: (J=compiled Java code, A=aot compiled Java code,
> > j=interpreted, Vv=VM code, C=native code)
> >
> > *V  [libjvm.so+0xcd3311]  PCMarkAndPushClosure::do_oop(oopDesc**)+0x51*V
> >   [libjvm.so+0xc6bf0b]  OopMapSet::oops_do(frame const*, RegisterMap
> const*,
> > OopClosure*)+0x2eb
> > V  [libjvm.so+0x765489]  frame::oops_do_internal(OopClosure*,
> > CodeBlobClosure*, RegisterMap*, bool)+0x99
> > V  [libjvm.so+0xf68b17]  JavaThread::oops_do(OopClosure*,
> > CodeBlobClosure*)+0x187
> > V  [libjvm.so+0xcce2f0]  ThreadRootsMarkingTask::do_it(GCTaskManager*,
> > unsigned int)+0xb0
> > V  [libjvm.so+0x7f422b]  GCTaskThread::run()+0x1eb
> > V  [libjvm.so+0xf707fd]  Thread::call_run()+0x10d
> > V  [libjvm.so+0xc875b7]  thread_native_entry(Thread*)+0xe7
> >
> > JavaThread 0x00007f5518004000 (nid = 75659) was being processed
> > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> > v  ~RuntimeStub::_new_array_Java
> > J 54174 c2
> > ch.qos.logback.classic.spi.ThrowableProxy.<init>(Ljava/lang/Throwable;)V
> > (207 bytes) @ 0x00007f6687d92678 [0x00007f6687d8c700+0x0000000000005f78]
> > J 334031 c2
> > com.xmas.webservice.exception.ExceptionLoggingWrapper.execute()V (1004
> > bytes) @ 0x00007f6686ede430 [0x00007f6686edd580+0x0000000000000eb0]
> > J 53431 c2
> >
> com.xmas.webservice.exception.mapper.AbstractExceptionMapper.toResponse(Lcom/xmas/beans/exceptions/mapper/V3ErrorCode;Ljava/lang/Exception;)Ljavax/ws/rs/core/Response;
> > (105 bytes) @ 0x00007f6687db88b0 [0x00007f6687db8660+0x0000000000000250]
> > J 63819 c2
> >
> com.xmas.webservice.exception.mapper.RequestBlockedExceptionMapper.toResponse(Ljava/lang/Exception;)Ljavax/ws/rs/core/Response;
> > (9 bytes) @ 0x00007f6686a6ed9c [0x00007f6686a6ecc0+0x00000000000000dc]
> > J 334032 c2
> >
> com.xmas.webservice.filters.ResponseSerializationWorker.processException()Ljava/io/InputStream;
> > (332 bytes) @ 0x00007f668992ad34 [0x00007f668992a840+0x00000000000004f4]
> > J 403918 c2
> > com.xmas.webservice.filters.ResponseSerializationWorker.execute()Z (272
> > bytes) @ 0x00007f66869d67fc [0x00007f66869d5e80+0x000000000000097c]
> > J 17530 c2
> com.lafaspot.common.concurrent.internal.WorkerWrapper.execute()Z
> > (208 bytes) @ 0x00007f66848b3708 [0x00007f66848b36a0+0x0000000000000068]
> > J 31970% c2
> >
> com.lafaspot.common.concurrent.internal.WorkerManagerOneThread.call()Lcom/lafaspot/common/concurrent/internal/WorkerManagerState;
> > (486 bytes) @ 0x00007f668608dcb0 [0x00007f668608d5e0+0x00000000000006d0]
> > j
> >
>  com.lafaspot.common.concurrent.internal.WorkerManagerOneThread.call()Ljava/lang/Object;+1
> > J 4889 c1 java.util.concurrent.FutureTask.run()V java.base at 13.0.1 (123
> > bytes) @ 0x00007f667d0be604 [0x00007f667d0bdf80+0x0000000000000684]
> > J 7487 c1
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
> > java.base at 13.0.1 (187 bytes) @ 0x00007f667dd45854
> > [0x00007f667dd44a60+0x0000000000000df4]
> > J 7486 c1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V
> > java.base at 13.0.1 (9 bytes) @ 0x00007f667d1f643c
> > [0x00007f667d1f63c0+0x000000000000007c]
> > J 7078 c1 java.lang.Thread.run()V java.base at 13.0.1 (17 bytes) @
> > 0x00007f667d1f2d74 [0x00007f667d1f2c40+0x0000000000000134]
> > v  ~StubRoutines::call_stub
> >
> > Not very frequent but ~90 days ~120 crashes with following signal
> > siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr:
> > 0x0000000000000000
> > This signal is generated when we try to access non canonical address in
> > linux.
> >
> > As suggested by Stefan in another thread i tried to
> > add VerifyAfterGC/VerifyBeforeGC but it seems to increase the latency and
> > applications not surviving our production traffic(timing out and requests
> > are failing).
> >
> > Questions
> > 1. When i looked at source code for printing stack trace i see following
> >
> https://github.com/openjdk/jdk11u/blob/master/src/hotspot/share/utilities/vmError.cpp#L696
> > (Prints native stack trace)
> >
> https://github.com/openjdk/jdk11u/blob/master/src/hotspot/share/utilities/vmError.cpp#L718
> > (printing Java thread stack trace if it is involved in GC crash)
> >    a. How do you know this java thread was involved in jvm crash?
> >    b. Can i assume the java thread printed after native stack trace was
> the
> > culprit?
> >    c. Since i am seeing the same frame (~RuntimeStub::_new_array_Java, J
> > 54174 c2 ch.qos.logback.classic.spi.ThrowableProxy.<init>..) but
> different
> > stack trace in both crashes can this be the root cause?
> >
> > 2. Thinking of excluding compilation
> > of ch.qos.logback.classic.spi.ThrowableProxy class and running in
> > production to see if compilation of this method is the cause. Does it
> make
> > sense?
> >
> > 3. Any other suggestion on debugging this further?
> >
> > TIA
> > Sundar
>
>
Thanks
Sundar