Crash in the MallocTracker::record_free

Mon Aug 15 16:04:38 UTC 2022

>From the hex dump, I think it possible that you try to delete a HandleMark,
and that crashes.

This corresponds with the code location in the stack, which is

```
V  [libjvm.so+0x1a6d77a]  Thread::~Thread()+0x8a
```

if gdb does not lie to me, this is near:

```
0x7ffff74f91a9 <Thread::~Thread()+137>:      callq  0x7ffff6843210
<HandleMark::operator delete(void*)>
```

He complains about a broken NMT header at 0x00007fba3f56ca90. So this is
the assumed NMT header:

```
0x00007fba3f56ca90:   00 00 00 00 00 00 00 00 98 d0 37 3f ba 7f 00 00
```

But this is what the caller actually freed. Since NMT header precedes the
allocation:

```
0x00007fba3f56caa0:   90 98 02 38 ba 7f 00 00   80 a3 02 38 ba 7f 00 00
0x00007fba3f56cab0:   e0 a3 02 38 ba 7f 00 00   f0 a3 02 38 ba 7f 00 00
0x00007fba3f56cac0:   c8 a4 02 38 ba 7f 00 00   d8 00 00 00 00 00 00 00
0x00007fba3f56cad0:   60 a6 02 38 ba 7f 00 00   a0 e4 89 3e ba 7f 00 00
0x00007fba3f56cae0:   90 98 02 38 ba 7f 00 00   90 ac 02 38 ba 7f 00 00
0x00007fba3f56caf0:   78 05 40 f0 b9 7f 00 00   00 00 00 00 00 00 00 00
0x00007fba3f56cb00:   00 00 00 00 00 00 00 00   00 00 00 00 00 00 00 00
```

HandleMark has seven pointer-sized members:

```
  Thread *_thread;              // thread that owns this mark
  HandleArea *_area;            // saved handle area
  Chunk *_chunk;                // saved arena chunk
  char *_hwm, *_max;            // saved arena info
  size_t _size_in_bytes;        // size of handle area
  // Link to previous active HandleMark in thread
  HandleMark* _previous_handle_mark;
```

If I interprete 0x00007fba3f56caa0 as HandleMark*:

```
0x00007fba3f56caa0:   90 98 02 38 ba 7f 00 00   80 a3 02 38 ba 7f 00 00
                      _thread                   _area
0x00007fba3f56cab0:   e0 a3 02 38 ba 7f 00 00   f0 a3 02 38 ba 7f 00 00
                      _chunk                    _hwm
0x00007fba3f56cac0:   c8 a4 02 38 ba 7f 00 00   d8 00 00 00 00 00 00 00
                      _max                      _size_in_bytes
0x00007fba3f56cad0:   60 a6 02 38 ba 7f 00 00
                      _previous_handle_mark
```

Note how this fits:

- The only 8-byte datum in this range is a size-like value for
_size_in_bytes (d8 = 216). All other slots look like pointers.

- _chunk (0x7fba3802a3e0), _hwm (0x7fba3802a3f0) and _max (0x7fba3802a4c8)
look like they fit an Arena of size 216 perfectly:
_max - _chunk = 232. A Chunk looks like this:

```
  //
  // +-----------+--+--------------------------------------------+
  // |           |g |                                            |
  // | Chunk     |a |               Payload                      |
  // |           |p |                                            |
  // +-----------+--+--------------------------------------------+
  // A           B  C                                            D
```

and it has two pointer-sized members, so sizeof(Chunk) = 16. (232 - 16)
gives you the arena size of 216.

Also, _hwm points to the beginning of the Arena (_chunk + 16). So this
looks like a mark for a HandleArea that has been cleared or never used.

-------------

I see two possibilities. Either someone overwrote the NMT header of the
block. Or, this HandleMark has never been allocated on the C-heap but lives
on the stack. In that case, something with
`Thread::thread->last_handle_mark()` could have gone wrong. Most HandleMark
instances live on the stack, the only ones that don't are the initial marks
created in `Thread::Thread()`.

-----

This is all pure hypotheses, since I don't have an hs-err file or any
information other than the stack and the hex dump.

Cheers, Thomas

On Mon, Aug 15, 2022 at 4:02 PM <daniel.daugherty at oracle.com> wrote:

> This discussion reminded me of one of the bugs that I filed last week:
>
>      JDK-8292318
> runtime/cds/appcds/dynamicArchive/methodHandles/MethodHandlesAsCollectorTest.java
>
> fails with "fatal error: NMT corruption: Block at <addr>: block address
> is unaligned"
>      https://bugs.openjdk.org/browse/JDK-8292318
>
> So there might some overlap here...
>
> Dan
>
>
>
> On 8/15/22 9:17 AM, David Holmes wrote:
> > On 15/08/2022 7:21 pm, Thomas Stüfe wrote:
> >>
> >>
> >> On Mon, Aug 15, 2022 at 11:16 AM David Holmes
> >> <david.holmes at oracle.com <mailto:david.holmes at oracle.com>> wrote:
> >>
> >>     On 15/08/2022 6:28 pm, Thomas Stüfe wrote:
> >>      > On Mon, Aug 15, 2022 at 10:27 AM Thomas Stüfe
> >>     <thomas.stuefe at gmail.com <mailto:thomas.stuefe at gmail.com>
> >>      > <mailto:thomas.stuefe at gmail.com
> >>     <mailto:thomas.stuefe at gmail.com>>> wrote:
> >>      >
> >>      >
> >>      >     On Mon, Aug 15, 2022 at 10:08 AM David Holmes
> >>      >     <david.holmes at oracle.com <mailto:david.holmes at oracle.com>
> >>     <mailto:david.holmes at oracle.com <mailto:david.holmes at oracle.com>>>
> >>     wrote:
> >>      >
> >>      >         Hi Thomas,
> >>      >
> >>      >         On 15/08/2022 3:42 pm, Thomas Stüfe wrote:
> >>      >          > ..The only way I could see this happening is if tty
> >>     itself is
> >>      >         not valid.
> >>      >          > Is this very early during initialization?
> >>      >
> >>      >         Quite the opposite, it is during VM termination.
> >>      >
> >>      >
> >>      >     Yes, that makes sense too. tty is defaultStream::instance,
> >>     which we
> >>      >     delete in ostream_exit() in DestroyVM.
> >>      >
> >>      >     Since this has bugging me for a while (more during
> >>      >     pre-initialization, but its the same problem) I opened
> >>      > https://bugs.openjdk.org/browse/JDK-8292351
> >>     <https://bugs.openjdk.org/browse/JDK-8292351>
> >>      >     <https://bugs.openjdk.org/browse/JDK-8292351
> >>     <https://bugs.openjdk.org/browse/JDK-8292351>> to track this.
> Should
> >>      >     be trivial to fix.
> >>
> >>     Perhaps trivial to not crash, but it means we are attempting various
> >>     "logging" actions after they can't be done - which is a pain.
> >> Afterall
> >>     we do want to see the real error here.
> >>
> >>
> >> That can happen though. As in your case, NMT telling us that the
> >> C-heap is corrupted during VM cleanup. So it is a valid scenario.
> >
> > Sure, I guess we can try and do the stream/tty deletion a bit later ...
> >
> >>      >
> >>      > As for your crash, if you comment out "delete
> >>     defaultStream::instance;"
> >>      > in ostream.cpp, you should be able to see the real issue.
> >>
> >>     Thanks - needed to do a little more than just that but I got the
> >> idea.
> >>     So that leads me to:
> >>
> >>     #  Internal Error
> >>
> (/scratch/users/daholme/jdk-dev3.git/open/src/hotspot/share/services/mallocHeader.inline.hpp:61),
> >>
> >>     pid=2666, tid=2668
> >>     #  fatal error: NMT corruption: Block at 0x00007fba3f56ca90: header
> >>     canary broken
> >>
> >>
> >> No hex dump?
> >
> > stdout: [NMT Block at 0x00007fba3f56ca90, corruption at:
> > 0x00007fba3f56ca90:
> > 0x00007fba3f56ca10:   e0 ca 56 3f ba 7f 00 00 f8 cc 56 3f ba 7f 00 00
> > 0x00007fba3f56ca20:   0a 00 00 00 00 00 00 00 78 05 40 f0 b9 7f 00 00
> > 0x00007fba3f56ca30:   40 03 7e 27 ba 7f 00 00 08 cc 56 3f ba 7f 00 00
> > 0x00007fba3f56ca40:   50 cb 56 3f ba 7f 00 00 f0 27 6f 3d ba 7f 00 00
> > 0x00007fba3f56ca50:   01 00 00 00 00 00 00 00 90 98 02 38 ba 7f 00 00
> > 0x00007fba3f56ca60:   02 00 00 00 00 00 00 00 a0 e4 89 3e ba 7f 00 00
> > 0x00007fba3f56ca70:   a0 ca 56 3f ba 7f 00 00 0a 00 00 00 0e 00 00 00
> > 0x00007fba3f56ca80:   40 03 7e 27 ba 7f 00 00 f8 cc 56 3f ba 7f 00 00
> > 0x00007fba3f56ca90:   00 00 00 00 00 00 00 00 98 d0 37 3f ba 7f 00 00
> > 0x00007fba3f56caa0:   90 98 02 38 ba 7f 00 00 80 a3 02 38 ba 7f 00 00
> > 0x00007fba3f56cab0:   e0 a3 02 38 ba 7f 00 00 f0 a3 02 38 ba 7f 00 00
> > 0x00007fba3f56cac0:   c8 a4 02 38 ba 7f 00 00 d8 00 00 00 00 00 00 00
> > 0x00007fba3f56cad0:   60 a6 02 38 ba 7f 00 00 a0 e4 89 3e ba 7f 00 00
> > 0x00007fba3f56cae0:   90 98 02 38 ba 7f 00 00 90 ac 02 38 ba 7f 00 00
> > 0x00007fba3f56caf0:   78 05 40 f0 b9 7f 00 00 00 00 00 00 00 00 00 00
> > 0x00007fba3f56cb00:   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >
> > David
> > -----
> >
> >>
> >>      >
> >>      >         Cheers,
> >>      >         David
> >>      >
> >>      >          > On Mon, Aug 15, 2022 at 7:40 AM Thomas Stüfe
> >>      >         <thomas.stuefe at gmail.com <mailto:thomas.stuefe at gmail.com
> >
> >>     <mailto:thomas.stuefe at gmail.com <mailto:thomas.stuefe at gmail.com>>
> >>      >          > <mailto:thomas.stuefe at gmail.com
> >>     <mailto:thomas.stuefe at gmail.com>
> >>      >         <mailto:thomas.stuefe at gmail.com
> >>     <mailto:thomas.stuefe at gmail.com>>>> wrote:
> >>      >          >
> >>      >          >     Hi David,
> >>      >          >
> >>      >          >     NMT detected a heap corruption or the pointer to
> >>     be freed is
> >>      >          >     invalid. And then the reporter itself crashed
> >> while
> >>      >         printing the
> >>      >          >     report. That is strange, since the reporter
> >> should be
> >>      >         resilient
> >>      >          >     against crashes. It uses os::print_hex_dump, which
> >>     should use
> >>      >          >     SafeFetch to check if the to-be-dumped info is
> >>     accessible.
> >>      >          >
> >>      >          >     Any more information?
> >>      >          >
> >>      >          >     Cheers, Thomas
> >>      >          >
> >>      >          >     On Mon, Aug 15, 2022 at 7:18 AM David Holmes
> >>      >          >     <david.holmes at oracle.com
> >>     <mailto:david.holmes at oracle.com> <mailto:david.holmes at oracle.com
> >>     <mailto:david.holmes at oracle.com>>
> >>      >         <mailto:david.holmes at oracle.com
> >>     <mailto:david.holmes at oracle.com>
> >>      >         <mailto:david.holmes at oracle.com
> >>     <mailto:david.holmes at oracle.com>>>> wrote:
> >>      >          >
> >>      >          >         I'm getting the below crash:
> >>      >          >
> >>      >          >         #
> >>      >          >         # A fatal error has been detected by the Java
> >>     Runtime
> >>      >         Environment:
> >>      >          >         #
> >>      >          >         #  SIGSEGV (0xb) at pc=0x00007f3e3e24b417,
> >>     pid=4537,
> >>      >         tid=4538
> >>      >          >         #
> >>      >          >         # JRE version: Java(TM) SE Runtime Environment
> >>     (20.0)
> >>      >         (fastdebug
> >>      >          >         build
> >>      >          >  20-internal-2022-06-10-0001400.daholme...)
> >>      >          >         # Java VM: Java HotSpot(TM) 64-Bit Server VM
> >>     (fastdebug
> >>      >          >  20-internal-2022-06-10-0001400.daholme..., mixed
> >>      >         mode, sharing,
> >>      >          >         tiered,
> >>      >          >         compressed oops, compressed class ptrs, g1 gc,
> >>      >         linux-amd64)
> >>      >          >         # Problematic frame:
> >>      >          >         # V  [libjvm.so+0x174b417]
> >>      >         outputStream::print_cr(char const*,
> >>      >          >         ...)+0x67
> >>      >          >
> >>      >          >         ---------------  T H R E A D ---------------
> >>      >          >
> >>      >          >         Current thread (0x00007f3e38029890):  Thread
> >>     "Unknown
> >>      >         thread"
> >>      >          >         [stack:
> >>      >          >  0x00007f3e3f8a9000,0x00007f3e3f9aa000] [id=4538]
> >>      >          >
> >>      >          >         Stack:
> >> [0x00007f3e3f8a9000,0x00007f3e3f9aa000],
> >>      >          >         sp=0x00007f3e3f9a84b0,
> >>      >          >         free space=1021k
> >>      >          >         Native frames: (J=compiled Java code,
> >>     j=interpreted,
> >>      >         Vv=VM code,
> >>      >          >         C=native code)
> >>      >          >         V  [libjvm.so+0x174b417]
> >> outputStream::print_cr(char
> >>      >         const*,
> >>      >          >         ...)+0x67
> >>      >          >         V  [libjvm.so+0x157d708]
> >>      >          >  MallocHeader::print_block_on_error(outputStream*,
> >>      >         unsigned
> >>      >          >         char*) const+0x38
> >>      >          >         V  [libjvm.so+0x157f2ef]
> >>      >         MallocHeader::assert_block_integrity()
> >>      >          >         const+0x10f
> >>      >          >         V  [libjvm.so+0x157eb6b]
> >>      >         MallocTracker::record_free(void*)+0x3b
> >>      >          >         V  [libjvm.so+0x172d3ef] os::free(void*)+0xbf
> >>      >          >         V  [libjvm.so+0x1a6d77a]
> >> Thread::~Thread()+0x8a
> >>      >          >         V  [libjvm.so+0x1060292]
> >> JavaThread::~JavaThread()+0x12
> >>      >          >         V  [libjvm.so+0x1a8b174]
> >> Threads::destroy_vm()+0x264
> >>      >          >
> >>      >          >         Any tips on trying to figure out what is going
> >>     wrong?
> >>      >          >
> >>      >          >         Thanks,
> >>      >          >         David
> >>      >          >
> >>      >
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-runtime-dev/attachments/20220815/78262011/attachment-0001.htm>