RFR: 8215355: Object monitor deadlock with no threads holding the monitor (using jemalloc 5.1)

Mon Nov 18 17:36:46 UTC 2019

Hi David,

On 11/17/19 9:30 PM, David Holmes wrote:
> Bug: https://bugs.openjdk.java.net/browse/JDK-8215355
> webrev: http://cr.openjdk.java.net/~dholmes/8215355/webrev/

src/hotspot/share/runtime/thread.hpp
     Nice catch!

src/hotspot/share/runtime/thread.cpp
     Nice catch!

     Not your issue, but these two lines feel strange/wrong:

         L1008:   // Allow non Java threads to call this without stack_base
         L1009:   if (_stack_base == NULL) return true;

     When _stack_base is NULL, any 'adr' is in the caller's stack? The
     comment is not helping understand why this is so...

src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/JavaThread.java
     Nice catch!

     Again, not your issue, but these four lines are questionable:

         L383     Address sp      = lastSPDbg();
         L384     Address stackBase = getStackBase();
         L385     // Be robust
         L386     if (sp == null) return false;

     I can see why a NULL sp would cause a "false" return since obviously
     something is a amiss in the frame. However, the C++ code doesn't make
     this check so why does the SA code?

     And this code doesn't check stackBase == NULL so it's not matching
     the C++ code either.

Thumbs up on the change itself. My queries above and below might warrant
new bugs or RFEs to be filed.

>
> This was a very difficult bug to track down and I want to publicly 
> acknowledge and thank the jemalloc folk (users and developers) for 
> continuing to investigate this issue from their side. Without their 
> persistence this issue would have languished.

You also deserve thanks for sticking with this bug: Thanks David!!

> The thread stack_base() is the first address above the thread's stack. 
> However, the "in stack" checks performed by Thread::on_local_stack and 
> Thread::is_in_stack allowed the checked address to be equal to the 
> stack_base() - which is not correct. Here's how this manifests as the 
> bug:
>
> - Let a JavaThread instance, T2, be allocated at the end of thread 
> T1's stack i.e. at T1->stack_base()
>   [This seems to be why this only reproduced with jemalloc.]
> - Let T2 lock an inflated monitor
> - Let T1 try to lock the same monitor
>   - T1 would consider the _owner field value (T2) as being in its 
> stack and so consider the monitor stack-locked by T1
>   - And so both T1 and T2 would have ownership of the monitor allowing 
> the monitor state (and application state) to be corrupted. This 
> results in a range of hangs and crashes depending on the exact 
> interleaving.

Ouch!

So I was wondering how this bug could happen with the thread alignment
logic that we have in place... search for the _real_malloc_address stuff...

And then I noticed that the logic only kicks in when UseBiasedLocking == 
true
(and this bug says it doesn't happen with -XX:-UseBiasedLocking):

src/hotspot/share/runtime/thread.cpp:

// ======= Thread ========
// Support for forcing alignment of thread objects for biased locking
void* Thread::allocate(size_t size, bool throw_excpt, MEMFLAGS flags) {
   if (UseBiasedLocking) {
     const size_t alignment = markWord::biased_lock_alignment;
     size_t aligned_size = size + (alignment - sizeof(intptr_t));
     void* real_malloc_addr = throw_excpt? AllocateHeap(aligned_size, 
flags, CURRENT_PC)
                                           : AllocateHeap(aligned_size, 
flags, CURRENT_PC,
AllocFailStrategy::RETURN_NULL);
     void* aligned_addr     = align_up(real_malloc_addr, alignment);
     assert(((uintptr_t) aligned_addr + (uintptr_t) size) <=
            ((uintptr_t) real_malloc_addr + (uintptr_t) aligned_size),
            "JavaThread alignment code overflowed allocated storage");
     if (aligned_addr != real_malloc_addr) {
       log_info(biasedlocking)("Aligned thread " INTPTR_FORMAT " to " 
INTPTR_FORMAT,
                               p2i(real_malloc_addr),
                               p2i(aligned_addr));
     }
     ((Thread*) aligned_addr)->_real_malloc_address = real_malloc_addr;
     return aligned_addr;
   } else {
     return throw_excpt? AllocateHeap(size, flags, CURRENT_PC)
                        : AllocateHeap(size, flags, CURRENT_PC, 
AllocFailStrategy::RETURN_NULL);
   }
}

The logging logic above:

     if (aligned_addr != real_malloc_addr) {
       log_info(biasedlocking)("Aligned thread " INTPTR_FORMAT " to " 
INTPTR_FORMAT,
                               p2i(real_malloc_addr),
                               p2i(aligned_addr));
     }

allows for real_malloc_addr to be the same as aligned_addr sometimes
(and no log message is issued), but I'm not sure from spelunking in
code whether it's really possible for:

     void* aligned_addr     = align_up(real_malloc_addr, alignment);

to return aligned_addr == real_malloc_addr. In other words, if
real_malloc_addr is already aligned perfectly, does align_up() still
change that value?

If it is possible for (aligned_addr == real_malloc_addr), then it is
possible for this bug to happen without jemalloc.

I've convinced myself that this is possible because of this line:

     size_t aligned_size = size + (alignment - sizeof(intptr_t));

If real_malloc_addr is already aligned perfectly and align_up()
always changed the input address, then the aligned_size would be
too small by sizeof(intptr_t) and we would have seen a buffer
overwrite like that over the many, many years.

So my conclusion is that it should be possible for this bug to
happen without jemalloc, but it would have to be rare.

> Interestingly Thread::is_in_usable_stack does not have this bug.

So we have Thread::is_in_usable_stack(), Thread::on_local_stack() and
Thread::is_in_stack()? I haven't compared all three side by side, but
there might be some cleanup work that can be done here (in a different
bug).

>
> The bug can be tracked way back to JDK-6699669 as explained in the bug 
> report. That issue also showed that the same bug existed in the SA 
> implementations of these "on stack" checks.

Ouch! JDK-6699669 was fixed in jdk7-B56 and looks like it was pushed
to the jdk6u train... so this bug goes back quite a ways...

Outstanding hunt David!

Dan

>
> Testing:
>   - The reproducer from the bug report, using jemalloc, ran over 5000 
> times without failing in any way.
>   - tiers 1-3 on all Oracle platforms
>   - serviceability/sa tests
>
> Thanks,
> David
> -----