RFR: 8215355: Object monitor deadlock with no threads holding the monitor (using jemalloc 5.1)

Mon Nov 18 13:25:48 UTC 2019

Hi Thomas,

Thanks for taking a look.

On 18/11/2019 9:58 pm, Thomas Stüfe wrote:
> This is evil :)
> 
> There might be more cases like this, e.g.
> 
> frame_x86.cpp  frame::is_interpreted_frame_valid():
> 
> if (locals > thread->stack_base() || locals < (address) fp()) return false;

Yes that might be a case where >= should be in use. I'll file another 
bug to check uses of stack_base().

> Also, I would have thought the little alloca() dance we do at the start 
> of thread_native_entry() would push the first real frame down the stack.

I know nothing of that code. :)

> The fix looks good.

Thanks!

David
-----

> Cheers, Thomas
> 
> 
> 
> On Mon, Nov 18, 2019 at 3:31 AM David Holmes <david.holmes at oracle.com 
> <mailto:david.holmes at oracle.com>> wrote:
> 
>     Bug: https://bugs.openjdk.java.net/browse/JDK-8215355
>     webrev: http://cr.openjdk.java.net/~dholmes/8215355/webrev/
> 
>     This was a very difficult bug to track down and I want to publicly
>     acknowledge and thank the jemalloc folk (users and developers) for
>     continuing to investigate this issue from their side. Without their
>     persistence this issue would have languished.
> 
>     The thread stack_base() is the first address above the thread's stack.
>     However, the "in stack" checks performed by Thread::on_local_stack and
>     Thread::is_in_stack allowed the checked address to be equal to the
>     stack_base() - which is not correct. Here's how this manifests as
>     the bug:
> 
>     - Let a JavaThread instance, T2, be allocated at the end of thread T1's
>     stack i.e. at T1->stack_base()
>         [This seems to be why this only reproduced with jemalloc.]
>     - Let T2 lock an inflated monitor
>     - Let T1 try to lock the same monitor
>         - T1 would consider the _owner field value (T2) as being in its
>     stack
>     and so consider the monitor stack-locked by T1
>         - And so both T1 and T2 would have ownership of the monitor
>     allowing
>     the monitor state (and application state) to be corrupted. This results
>     in a range of hangs and crashes depending on the exact interleaving.
> 
>     Interestingly Thread::is_in_usable_stack does not have this bug.
> 
>     The bug can be tracked way back to JDK-6699669 as explained in the bug
>     report. That issue also showed that the same bug existed in the SA
>     implementations of these "on stack" checks.
> 
>     Testing:
>         - The reproducer from the bug report, using jemalloc, ran over 5000
>     times without failing in any way.
>         - tiers 1-3 on all Oracle platforms
>         - serviceability/sa tests
> 
>     Thanks,
>     David
>     -----
>