Need help on debugging JVM crash

Tue Mar 3 20:12:04 UTC 2020

The crash happened while the GC is running. I tried disasm of the 
crashing address PCMarkAndPushClosure::do_oop(oopDesc**)+0x51 (from the 
13.0.1+9 GA binaries of AdoptJDK)

(gdb) x/80i _ZN20PCMarkAndPushClosure6do_oopEPP7oopDesc
    <>:    push   %rbp
    <+1>:    mov    %rsp,%rbp
    <+4>:    push   %r13
    <+6>:    push   %r12
    <+8>:    push   %rbx
    <+9>:    sub    $0x8,%rsp
    <+13>:    mov    (%rsi),%rbx  ;;; rbx = oop
    <+16>:    test   %rbx,%rbx    ;;; oop != null?
    <+19>:    je     0x7ffff67ca317 
<_ZN20PCMarkAndPushClosure6do_oopEPP7oopDesc+87>
    <+21>:    lea    0x9c0afc(%rip),%rax        # 0x7ffff718add8 
<_ZN20ParCompactionManager12_mark_bitmapE>
    <+28>:    mov    %rbx,%rcx    ;;; rcx = oop
    <+31>:    mov    (%rax),%rdx  ;;; rdx = 
ParCompactionManager::_mark_bitmap
    <+34>:    sub    (%rdx),%rcx  ;;; rcx = oop - 
_mark_bitmap->_region_start
    <+37>:    mov    0x10(%rdx),%rdx ;; rdx = _mark_bitmap->_beg_bits->_map
    <+41>:    mov    %rcx,%rax   ;;;  rax = oop - 
_mark_bitmap->_region_start
    <+44>:    lea    0x93b935(%rip),%rcx       # 0x7ffff7105c28 
<LogMinObjAlignment>
    <+51>:    shr    $0x3,%rax
    <+55>:    mov    (%rcx),%ecx
    <+57>:    shr    %cl,%rax
    <+60>:    mov    %rax,%rcx
    <+63>:    mov    %rax,%rsi      ;;; rsi = index of oop inside 
mark_bitmap
    <+66>:    mov    $0x1,%eax
    <+71>:    and    $0x3f,%ecx
    <+74>:    shr    $0x6,%rsi
    <+78>:    shl    %cl,%rax
    <+81>:    test   %rax,(%rdx,%rsi,8) << crash

This looks like that the oop that we try to mark is actually outside of 
the heap range, so trying to mark it in the mark_bitmap causes this:

    siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr:
    0x0000000000000000

Here are the values of the registers for the "test" instruction above:

     RAX=0x0000000000000001 is an unknown value
     RDX=0x00007f55af000000 points into unknown readable memory: 01 00 
00 00 01 00 00 04
     RSI=0x007fffc05491d000 is an unknown value

As you can see, RSI is very large, which means you have an invalid oop 
in the stack that's probably very large.

     [libjvm.so+0xcc0121] PCMarkAndPushClosure::do_oop(oopDesc**)+0x51
     [libjvm.so+0xc58c8b]  OopMapSet::oops_do()
     [libjvm.so+0x7521e9]  frame::oops_do_internal()+0x99 <<<< HERE
     [libjvm.so+0xf55757]  JavaThread::oops_do()+0x187

As others have mentioned, this kind of error is usually caused by 
invalid use of Unsafe or JNI that leads to heap corruption. However, 
it's plausible that somehow the VM has messed up the frame and tries to 
mark an invalid oop.

Thanks
- Ioi

On 3/3/20 8:02 AM, Sundara Mohan M wrote:
> Hi,
>      I am seeing JVM crashes on our system in GC Thread with parallel gc on
> x86 linux. Observed the same crash happening on JVM-11.0.6/13.0.2/13.0.1 GA
> builds.
> Adding some logs lines to give some context.
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f669c964311, pid=66684, tid=71106
> #
> # JRE version: OpenJDK Runtime Environment (13.0.1+9) (build 13.0.1+9)
> # Java VM: OpenJDK 64-Bit Server VM (13.0.1+9, mixed mode, tiered, parallel
> gc, linux-amd64)
> # Problematic frame:
> # V  [libjvm.so+0xcd3311]  PCMarkAndPushClosure::do_oop(oopDesc**)+0x51
> #
> # No core dump will be written. Core dumps have been disabled. To enable
> core dumping, try "ulimit -c unlimited" before starting Java again
> #
> # If you would like to submit a bug report, please visit:
> #   https://github.com/AdoptOpenJDK/openjdk-build/issues
> #
>
> Host: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz, 48 cores, 125G, Red Hat
> Enterprise Linux Server release 6.10 (Santiago)
> Time: Thu Feb  6 11:43:48 2020 UTC elapsed time: 198626 seconds (2d 7h 10m
> 26s)
>
>
> Following is the stack trace
> ex1:
> Stack: [0x00007fd01cbdb000,0x00007fd01ccdb000],  sp=0x00007fd01ccd8890,
>   free space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code,
> j=interpreted, Vv=VM code, C=native code)
> *V  [libjvm.so+0xcc0121]  PCMarkAndPushClosure::do_oop(oopDesc**)+0x51*
> V  [libjvm.so+0xc58c8b]  OopMapSet::oops_do(frame const*, RegisterMap
> const*, OopClosure*)+0x2eb
> V  [libjvm.so+0x7521e9]  frame::oops_do_internal(OopClosure*,
> CodeBlobClosure*, RegisterMap*, bool)+0x99
> V  [libjvm.so+0xf55757]  JavaThread::oops_do(OopClosure*,
> CodeBlobClosure*)+0x187
> V  [libjvm.so+0xcbb100]  ThreadRootsMarkingTask::do_it(GCTaskManager*,
> unsigned int)+0xb0
> V  [libjvm.so+0x7e0f8b]  GCTaskThread::run()+0x1eb
> V  [libjvm.so+0xf5d43d]  Thread::call_run()+0x10d
> V  [libjvm.so+0xc74337]  thread_native_entry(Thread*)+0xe7
>
> JavaThread 0x00007fbeb9209800 (nid = 82380) was being processed
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> v  ~RuntimeStub::_new_array_Java
> J 62465 c2
> ch.qos.logback.classic.spi.ThrowableProxy.<init>(Ljava/lang/Throwable;)V
> (207 bytes) @ 0x00007fd00ad43704 [0x00007fd00ad41420+0x00000000000022e4]
> J 474206 c2
> org.eclipse.jetty.util.log.JettyAwareLogger.log(Lorg/slf4j/Marker;ILjava/lang/String;[Ljava/lang/Object;Ljava/lang/Throwable;)V
> (134 bytes) @ 0x00007fd00c4e81ec [0x00007fd00c4e7ee0+0x000000000000030c]
> j
>   org.eclipse.jetty.util.log.JettyAwareLogger.warn(Ljava/lang/String;Ljava/lang/Throwable;)V+7
> j
>   org.eclipse.jetty.util.log.Slf4jLog.warn(Ljava/lang/String;Ljava/lang/Throwable;)V+6
> j
>   org.eclipse.jetty.server.HttpChannel.handleException(Ljava/lang/Throwable;)V+181
> j
>   org.eclipse.jetty.server.HttpChannelOverHttp.handleException(Ljava/lang/Throwable;)V+13
> J 64106 c2 org.eclipse.jetty.server.HttpChannel.handle()Z (997 bytes) @
> 0x00007fd00c6d2cd4 [0x00007fd00c6cdec0+0x0000000000004e14]
> J 280430 c2 org.eclipse.jetty.server.HttpConnection.onFillable()V (334
> bytes) @ 0x00007fd00da925f0 [0x00007fd00da91e40+0x00000000000007b0]
> J 41979 c2 org.eclipse.jetty.io.ChannelEndPoint$2.run()V (12 bytes) @
> 0x00007fd00a14f604 [0x00007fd00a14f4e0+0x0000000000000124]
> J 86362 c2 org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run()V
> (565 bytes) @ 0x00007fd0087d7e34 [0x00007fd0087d7cc0+0x0000000000000174]
> J 75998 c2 java.lang.Thread.run()V java.base at 13.0.2 (17 bytes) @
> 0x00007fd00c93b8d8 [0x00007fd00c93b8a0+0x0000000000000038]
> v  ~StubRoutines::call_stub
>
> ex2:
> Stack: [0x00007f669869f000,0x00007f669879f000],  sp=0x00007f669879c890,
>   free space=1014k
> Native frames: (J=compiled Java code, A=aot compiled Java code,
> j=interpreted, Vv=VM code, C=native code)
>
> *V  [libjvm.so+0xcd3311]  PCMarkAndPushClosure::do_oop(oopDesc**)+0x51*V
>   [libjvm.so+0xc6bf0b]  OopMapSet::oops_do(frame const*, RegisterMap const*,
> OopClosure*)+0x2eb
> V  [libjvm.so+0x765489]  frame::oops_do_internal(OopClosure*,
> CodeBlobClosure*, RegisterMap*, bool)+0x99
> V  [libjvm.so+0xf68b17]  JavaThread::oops_do(OopClosure*,
> CodeBlobClosure*)+0x187
> V  [libjvm.so+0xcce2f0]  ThreadRootsMarkingTask::do_it(GCTaskManager*,
> unsigned int)+0xb0
> V  [libjvm.so+0x7f422b]  GCTaskThread::run()+0x1eb
> V  [libjvm.so+0xf707fd]  Thread::call_run()+0x10d
> V  [libjvm.so+0xc875b7]  thread_native_entry(Thread*)+0xe7
>
> JavaThread 0x00007f5518004000 (nid = 75659) was being processed
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> v  ~RuntimeStub::_new_array_Java
> J 54174 c2
> ch.qos.logback.classic.spi.ThrowableProxy.<init>(Ljava/lang/Throwable;)V
> (207 bytes) @ 0x00007f6687d92678 [0x00007f6687d8c700+0x0000000000005f78]
> J 334031 c2
> com.xmas.webservice.exception.ExceptionLoggingWrapper.execute()V (1004
> bytes) @ 0x00007f6686ede430 [0x00007f6686edd580+0x0000000000000eb0]
> J 53431 c2
> com.xmas.webservice.exception.mapper.AbstractExceptionMapper.toResponse(Lcom/xmas/beans/exceptions/mapper/V3ErrorCode;Ljava/lang/Exception;)Ljavax/ws/rs/core/Response;
> (105 bytes) @ 0x00007f6687db88b0 [0x00007f6687db8660+0x0000000000000250]
> J 63819 c2
> com.xmas.webservice.exception.mapper.RequestBlockedExceptionMapper.toResponse(Ljava/lang/Exception;)Ljavax/ws/rs/core/Response;
> (9 bytes) @ 0x00007f6686a6ed9c [0x00007f6686a6ecc0+0x00000000000000dc]
> J 334032 c2
> com.xmas.webservice.filters.ResponseSerializationWorker.processException()Ljava/io/InputStream;
> (332 bytes) @ 0x00007f668992ad34 [0x00007f668992a840+0x00000000000004f4]
> J 403918 c2
> com.xmas.webservice.filters.ResponseSerializationWorker.execute()Z (272
> bytes) @ 0x00007f66869d67fc [0x00007f66869d5e80+0x000000000000097c]
> J 17530 c2 com.lafaspot.common.concurrent.internal.WorkerWrapper.execute()Z
> (208 bytes) @ 0x00007f66848b3708 [0x00007f66848b36a0+0x0000000000000068]
> J 31970% c2
> com.lafaspot.common.concurrent.internal.WorkerManagerOneThread.call()Lcom/lafaspot/common/concurrent/internal/WorkerManagerState;
> (486 bytes) @ 0x00007f668608dcb0 [0x00007f668608d5e0+0x00000000000006d0]
> j
>   com.lafaspot.common.concurrent.internal.WorkerManagerOneThread.call()Ljava/lang/Object;+1
> J 4889 c1 java.util.concurrent.FutureTask.run()V java.base at 13.0.1 (123
> bytes) @ 0x00007f667d0be604 [0x00007f667d0bdf80+0x0000000000000684]
> J 7487 c1
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
> java.base at 13.0.1 (187 bytes) @ 0x00007f667dd45854
> [0x00007f667dd44a60+0x0000000000000df4]
> J 7486 c1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V
> java.base at 13.0.1 (9 bytes) @ 0x00007f667d1f643c
> [0x00007f667d1f63c0+0x000000000000007c]
> J 7078 c1 java.lang.Thread.run()V java.base at 13.0.1 (17 bytes) @
> 0x00007f667d1f2d74 [0x00007f667d1f2c40+0x0000000000000134]
> v  ~StubRoutines::call_stub
>
> Not very frequent but ~90 days ~120 crashes with following signal
> siginfo: si_signo: 11 (SIGSEGV), si_code: 128 (SI_KERNEL), si_addr:
> 0x0000000000000000
> This signal is generated when we try to access non canonical address in
> linux.
>
> As suggested by Stefan in another thread i tried to
> add VerifyAfterGC/VerifyBeforeGC but it seems to increase the latency and
> applications not surviving our production traffic(timing out and requests
> are failing).
>
> Questions
> 1. When i looked at source code for printing stack trace i see following
> https://github.com/openjdk/jdk11u/blob/master/src/hotspot/share/utilities/vmError.cpp#L696
> (Prints native stack trace)
> https://github.com/openjdk/jdk11u/blob/master/src/hotspot/share/utilities/vmError.cpp#L718
> (printing Java thread stack trace if it is involved in GC crash)
>    a. How do you know this java thread was involved in jvm crash?
>    b. Can i assume the java thread printed after native stack trace was the
> culprit?
>    c. Since i am seeing the same frame (~RuntimeStub::_new_array_Java, J
> 54174 c2 ch.qos.logback.classic.spi.ThrowableProxy.<init>..) but different
> stack trace in both crashes can this be the root cause?
>
> 2. Thinking of excluding compilation
> of ch.qos.logback.classic.spi.ThrowableProxy class and running in
> production to see if compilation of this method is the cause. Does it make
> sense?
>
> 3. Any other suggestion on debugging this further?
>
> TIA
> Sundar