RFR: 8247729: GetObjectMonitorUsage() might return inconsistent information

Wed Jun 17 15:42:04 UTC 2020

Hi Yasumasa,

This fix is not enough.
The function JvmtiEnvBase::get_object_monitor_usage works in two modes: 
in VMop and non-VMop.
The non-VMop mode has to be removed.

Thanks,
Serguei

On 6/17/20 02:18, Yasumasa Suenaga wrote:
> (Change subject for RFR)
>
> Hi,
>
> I filed it to JBS and upload a webrev for it.
> Could you review it?
>
>   JBS: https://bugs.openjdk.java.net/browse/JDK-8247729
>   webrev: http://cr.openjdk.java.net/~ysuenaga/JDK-8247729/webrev.00/
>
> This change has passed tests on submit repo.
> Also I tested it with serviceability/jvmti and vmTestbase/nsk/jvmti on 
> Linux x64.
>
>
> Thanks,
>
> Yasumasa
>
>
> On 2020/06/17 14:37, serguei.spitsyn at oracle.com wrote:
>> Yes. It seems we have a consensus.
>> Thank you for taking care about it.
>>
>> Thanks,
>> Serguei
>>
>>
>> On 6/16/20 18:34, David Holmes wrote:
>>>> Ok, may I file it to JBS and fix it? 
>>>
>>> Go for it! :)
>>>
>>> Cheers,
>>> David
>>>
>>> On 17/06/2020 10:23 am, Yasumasa Suenaga wrote:
>>>> On 2020/06/17 8:47, serguei.spitsyn at oracle.com wrote:
>>>>> Hi Dan, David and Yasumasa,
>>>>>
>>>>>
>>>>> On 6/16/20 07:39, Daniel D. Daugherty wrote:
>>>>>> On 6/15/20 9:28 PM, David Holmes wrote:
>>>>>>> On 16/06/2020 10:57 am, Daniel D. Daugherty wrote:
>>>>>>>> On 6/15/20 7:19 PM, David Holmes wrote:
>>>>>>>>> On 16/06/2020 8:40 am, Daniel D. Daugherty wrote:
>>>>>>>>>> On 6/15/20 6:14 PM, David Holmes wrote:
>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>
>>>>>>>>>>> On 15/06/2020 11:38 pm, Daniel D. Daugherty wrote:
>>>>>>>>>>>> On 6/15/20 3:26 AM, David Holmes wrote:
>>>>>>>>>>>>> On 15/06/2020 4:02 pm, Yasumasa Suenaga wrote:
>>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2020/06/15 14:15, David Holmes wrote:
>>>>>>>>>>>>>>> Hi Yasumasa,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15/06/2020 2:49 pm, Yasumasa Suenaga wrote:
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I wonder why JvmtiEnvBase::get_object_monitor_usage() 
>>>>>>>>>>>>>>>> (implementation of GetObjectMonitorUsage()) does not 
>>>>>>>>>>>>>>>> perform at safepoint.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> GetObjectMonitorUsage will use a safepoint if the target 
>>>>>>>>>>>>>>> is not suspended:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> jvmtiError
>>>>>>>>>>>>>>> JvmtiEnv::GetObjectMonitorUsage(jobject object, 
>>>>>>>>>>>>>>> jvmtiMonitorUsage* info_ptr) {
>>>>>>>>>>>>>>>    JavaThread* calling_thread = JavaThread::current();
>>>>>>>>>>>>>>>    jvmtiError err = 
>>>>>>>>>>>>>>> get_object_monitor_usage(calling_thread, object, info_ptr);
>>>>>>>>>>>>>>>    if (err == JVMTI_ERROR_THREAD_NOT_SUSPENDED) {
>>>>>>>>>>>>>>>      // Some of the critical threads were not suspended. 
>>>>>>>>>>>>>>> go to a safepoint and try again
>>>>>>>>>>>>>>>      VM_GetObjectMonitorUsage op(this, calling_thread, 
>>>>>>>>>>>>>>> object, info_ptr);
>>>>>>>>>>>>>>>      VMThread::execute(&op);
>>>>>>>>>>>>>>>      err = op.result();
>>>>>>>>>>>>>>>    }
>>>>>>>>>>>>>>>    return err;
>>>>>>>>>>>>>>> } /* end GetObject */
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I saw this code, so I guess there are some cases when 
>>>>>>>>>>>>>> JVMTI_ERROR_THREAD_NOT_SUSPENDED is not returned from 
>>>>>>>>>>>>>> get_object_monitor_usage().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Monitor owner would be acquired from monitor object at 
>>>>>>>>>>>>>>>> first [1], but it would perform concurrently.
>>>>>>>>>>>>>>>> If owner thread is not suspended, the owner might be 
>>>>>>>>>>>>>>>> changed to others in subsequent code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For example, the owner might release the monitor before 
>>>>>>>>>>>>>>>> [2].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The expectation is that when we find an owner thread it 
>>>>>>>>>>>>>>> is either suspended or not. If it is suspended then it 
>>>>>>>>>>>>>>> cannot release the monitor. If it is not suspended we 
>>>>>>>>>>>>>>> detect that and redo the whole query at a safepoint.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the owner thread might resume unfortunately after 
>>>>>>>>>>>>>> suspending check.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes you are right. I was thinking resuming also required a 
>>>>>>>>>>>>> safepoint but it only requires the Threads_lock. So yes 
>>>>>>>>>>>>> the code is wrong.
>>>>>>>>>>>>
>>>>>>>>>>>> Which code is wrong?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, a rogue resume can happen when the 
>>>>>>>>>>>> GetObjectMonitorUsage() caller
>>>>>>>>>>>> has started the process of gathering the information while 
>>>>>>>>>>>> not at a
>>>>>>>>>>>> safepoint. Thus the information returned by 
>>>>>>>>>>>> GetObjectMonitorUsage()
>>>>>>>>>>>> might be stale, but that's a bug in the agent code.
>>>>>>>>>>>
>>>>>>>>>>> The code tries to make sure that it either collects data 
>>>>>>>>>>> about a monitor owned by a thread that is suspended, or else 
>>>>>>>>>>> it collects that data at a safepoint. But the owning thread 
>>>>>>>>>>> can be resumed just after the code determined it was 
>>>>>>>>>>> suspended. The monitor can then be released and the 
>>>>>>>>>>> information gathered not only stale but potentially 
>>>>>>>>>>> completely wrong as it could now be owned by a different 
>>>>>>>>>>> thread and will report that thread's entry count.
>>>>>>>>>>
>>>>>>>>>> If the agent is not using SuspendThread(), then as soon as
>>>>>>>>>> GetObjectMonitorUsage() returns to the caller the information
>>>>>>>>>> can be stale. In fact as soon as the implementation returns
>>>>>>>>>> from the safepoint that gathered the info, the target thread
>>>>>>>>>> could have moved on.
>>>>>>>>>
>>>>>>>>> That isn't the issue. That the info is stale is fine. But the 
>>>>>>>>> expectation is that the information was actually an accurate 
>>>>>>>>> snapshot of the state of the monitor at some point in time. 
>>>>>>>>> The current code does not ensure that.
>>>>>>>>
>>>>>>>> Please explain. I clearly don't understand why you think the info
>>>>>>>> returned isn't "an accurate snapshot of the state of the monitor
>>>>>>>> at some point in time".
>>>>>>>
>>>>>>> Because it may not be a "snapshot" at all. There is no 
>>>>>>> atomicity**. The reported owner thread may not own it any longer 
>>>>>>> when the entry count is read, so straight away you may have the 
>>>>>>> wrong entry count information. The set of threads trying to 
>>>>>>> acquire the monitor, or wait on the monitor can change in 
>>>>>>> unexpected ways. It would be possible for instance to report the 
>>>>>>> same thread as being the owner, being blocked trying to enter 
>>>>>>> the monitor, and being in the wait-set of the monitor - 
>>>>>>> apparently all at the same time!
>>>>>>>
>>>>>>> ** even if the owner is suspended we don't have complete 
>>>>>>> atomicity because threads can join the set of threads trying to 
>>>>>>> enter the monitor (unless they are all suspended).
>>>>>>
>>>>>> Consider the case when the monitor's owner is _not_ suspended:
>>>>>>
>>>>>>   - GetObjectMonitorUsage() uses a safepoint to gather the info 
>>>>>> about
>>>>>>     the object's monitor. Since we're at a safepoint, the info that
>>>>>>     we are gathering cannot change until we return from the 
>>>>>> safepoint.
>>>>>>     It is a snapshot and a valid one at that.
>>>>>>
>>>>>> Consider the case when the monitor's owner is suspended:
>>>>>>
>>>>>>   - GetObjectMonitorUsage() will gather info about the object's
>>>>>>     monitor while _not_ at a safepoint. Assuming that no other
>>>>>>     thread is suspended, then entry_count can change because
>>>>>>     another thread can block on entry while we are gathering
>>>>>>     info. waiter_count and waiters can change if a thread was
>>>>>>     in a timed wait that has timed out and now that thread is
>>>>>>     blocked on re-entry. I don't think that notify_waiter_count
>>>>>>     and notify_waiters can change.
>>>>>>
>>>>>>     So in this case, the owner info and notify info is stable,
>>>>>>     but the entry_count and waiter info is not stable.
>>>>>>
>>>>>> Consider the case when the monitor is not owned:
>>>>>>
>>>>>>   - GetObjectMonitorUsage() will start to gather info about the
>>>>>>     object's monitor while _not_ at a safepoint. If it finds a
>>>>>>     thread on the entry queue that is not suspended, then it will
>>>>>>     bail out and redo the info gather at a safepoint. I just
>>>>>>     noticed that it doesn't check for suspension for the threads
>>>>>>     on the waiters list so a timed Object.wait() call can cause
>>>>>>     some confusion here.
>>>>>>
>>>>>>     So in this case, the owner info is not stable if a thread
>>>>>>     comes out of a timed wait and reenters the monitor. This
>>>>>>     case is no different than if a "barger" thread comes in
>>>>>>     after the NULL owner field is observed and enters the
>>>>>>     monitor. We'll return that there is no owner, a list of
>>>>>>     suspended pending entry thread and a list of waiting
>>>>>>     threads. The reality is that the object's monitor is
>>>>>>     owned by the "barger" that completely bypassed the entry
>>>>>>     queue by virtue of seeing the NULL owner field at exactly
>>>>>>     the right time.
>>>>>>
>>>>>> So the owner field is only stable when we have an owner. If
>>>>>> that owner is not suspended, then the other fields are also
>>>>>> stable because we gathered the info at a safepoint. If the
>>>>>> owner is suspended, then the owner and notify info is stable,
>>>>>> but the entry_count and waiter info is not stable.
>>>>>>
>>>>>> If we have a NULL owner field, then the info is only stable
>>>>>> if you have a non-suspended thread on the entry list. Ouch!
>>>>>> That's deterministic, but not without some work.
>>>>>>
>>>>>>
>>>>>> Okay so only when we gather the info at a safepoint is all
>>>>>> of it a valid and stable snapshot. Unfortunately, we only
>>>>>> do that at a safepoint when the owner thread is not suspended
>>>>>> or if owner == NULL and one of the entry threads is not
>>>>>> suspended. If either of those conditions is not true, then
>>>>>> the different pieces of info is unstable to varying degrees.
>>>>>>
>>>>>> As for this claim:
>>>>>>
>>>>>>> It would be possible for instance to report the same thread
>>>>>>> as being the owner, being blocked trying to enter the monitor,
>>>>>>> and being in the wait-set of the monitor - apparently all at
>>>>>>> the same time! 
>>>>>>
>>>>>> I can't figure out a way to make that scenario work. If the
>>>>>> thread is seen as the owner and is not suspended, then we
>>>>>> gather info at a safepoint. If it is suspended, then it can't
>>>>>> then be seen as on the entry queue or on the wait queue since
>>>>>> it is suspended. If it is seen on the entry queue and is not
>>>>>> suspended, then we gather info at a safepoint. If it is
>>>>>> suspended on the entry queue, then it can't be seen on the
>>>>>> wait queue.
>>>>>>
>>>>>> So the info instability of this API is bad, but it's not
>>>>>> quite that bad. :-) (That is a small mercy.)
>>>>>>
>>>>>>
>>>>>> Handshaking is not going to make this situation any better
>>>>>> for GetObjectMonitorUsage(). If the monitor is owned and we
>>>>>> handshake with the owner, the stability or instability of
>>>>>> the other fields remains the same as when SuspendThread is
>>>>>> used. Handshaking with all threads won't make the data as
>>>>>> stable as when at a safepoint because individual threads
>>>>>> can resume execution after doing their handshake so there
>>>>>> will still be field instability.
>>>>>>
>>>>>>
>>>>>> Short version: GetObjectMonitorUsage() should only gather
>>>>>> data at a safepoint. Yes, I've changed my mind.
>>>>>
>>>>> I agree with this.
>>>>> The advantages are:
>>>>>   - the result is stable
>>>>>   - the implementation can be simplified
>>>>>
>>>>> Performance impact is not very clear but should not be that
>>>>> big as suspending all the threads has some overhead too.
>>>>> I'm not sure if using handshakes can make performance better.
>>>>
>>>> Ok, may I file it to JBS and fix it?
>>>>
>>>> Yasumasa
>>>>
>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>>> Dan
>>>>>>
>>>>>>>
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The only way to make sure you don't have stale information is
>>>>>>>>>> to use SuspendThread(), but it's not required. Perhaps the doc
>>>>>>>>>> should have more clear about the possibility of returning stale
>>>>>>>>>> info. That's a question for Robert F.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> GetObjectMonitorUsage says nothing about thread's being 
>>>>>>>>>>> suspended so I can't see how this could be construed as an 
>>>>>>>>>>> agent bug.
>>>>>>>>>>
>>>>>>>>>> In your scenario above, you mention that the target thread was
>>>>>>>>>> suspended, GetObjectMonitorUsage() was called while the target
>>>>>>>>>> was suspended, and then the target thread was resumed after
>>>>>>>>>> GetObjectMonitorUsage() checked for suspension, but before
>>>>>>>>>> GetObjectMonitorUsage() was able to gather the info.
>>>>>>>>>>
>>>>>>>>>> All three of those calls: SuspendThread(), 
>>>>>>>>>> GetObjectMonitorUsage()
>>>>>>>>>> and ResumeThread() are made by the agent and the agent should 
>>>>>>>>>> not
>>>>>>>>>> resume the target thread while also calling 
>>>>>>>>>> GetObjectMonitorUsage().
>>>>>>>>>> The calls were allowed to be made out of order so agent bug.
>>>>>>>>>
>>>>>>>>> Perhaps. I was thinking more generally about an independent 
>>>>>>>>> resume, but you're right that doesn't really make a lot of 
>>>>>>>>> sense. But when the spec says nothing about suspension ...
>>>>>>>>
>>>>>>>> And it is intentional that suspension is not required. JVM/DI 
>>>>>>>> and JVM/PI
>>>>>>>> used to require suspension for these kinds of get-the-info 
>>>>>>>> APIs. JVM/TI
>>>>>>>> intentionally was designed to not require suspension.
>>>>>>>>
>>>>>>>> As I've said before, we could add a note about the data being 
>>>>>>>> potentially
>>>>>>>> stale unless SuspendThread is used. I think of it like stat(2). 
>>>>>>>> You can
>>>>>>>> fetch the file's info, but there's no guarantee that the info 
>>>>>>>> is current
>>>>>>>> by the time you process what you got back. Is it too much 
>>>>>>>> motherhood to
>>>>>>>> state that the data might be stale? I could go either way...
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Using a handshake on the owner thread will allow this to be 
>>>>>>>>>>> fixed in the future without forcing/using any safepoints.
>>>>>>>>>>
>>>>>>>>>> I have to think about that which is why I'm avoiding talking 
>>>>>>>>>> about
>>>>>>>>>> handshakes in this thread.
>>>>>>>>>
>>>>>>>>> Effectively the handshake can "suspend" the thread whilst the 
>>>>>>>>> monitor is queried. In effect the operation would create a 
>>>>>>>>> per-thread safepoint.
>>>>>>>>
>>>>>>>> I "know" that, but I still need time to think about it and 
>>>>>>>> probably
>>>>>>>> see the code to see if there are holes...
>>>>>>>>
>>>>>>>>
>>>>>>>>> Semantically it is no different to the code actually 
>>>>>>>>> suspending the owner thread, but it can't actually do that 
>>>>>>>>> because suspends/resume don't nest.
>>>>>>>>
>>>>>>>> Yeah... we used have a suspend count back when we tracked 
>>>>>>>> internal and
>>>>>>>> external suspends separately. That was a nightmare...
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>>> Dan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> JavaThread::is_ext_suspend_completed() is used to check 
>>>>>>>>>>>>>> thread state, it returns `true` when the thread is 
>>>>>>>>>>>>>> sleeping [3], or when it performs in native [4].
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sure but if the thread is actually suspended it can't 
>>>>>>>>>>>>> continue execution in the VM or in Java code.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This appears to be an optimisation for the assumed 
>>>>>>>>>>>>>>> common case where threads are first suspended and then 
>>>>>>>>>>>>>>> the monitors are queried.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree with this, but I could find out it from JVMTI 
>>>>>>>>>>>>>> spec - it just says "Get information about the object's 
>>>>>>>>>>>>>> monitor."
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes it was just an implementation optimisation, nothing to 
>>>>>>>>>>>>> do with the spec.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> GetObjectMonitorUsage() might return incorrect 
>>>>>>>>>>>>>> information in some case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It starts with finding owner thread, but the owner might 
>>>>>>>>>>>>>> be just before wakeup.
>>>>>>>>>>>>>> So I think it is more safe if GetObjectMonitorUsage() is 
>>>>>>>>>>>>>> called at safepoint in any case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Except we're moving away from safepoints to using 
>>>>>>>>>>>>> Handshakes, so this particular operation will require that 
>>>>>>>>>>>>> the apparent owner is Handshake-safe (by entering a 
>>>>>>>>>>>>> handshake with it) before querying the monitor. This would 
>>>>>>>>>>>>> still be preferable I think to always using a safepoint 
>>>>>>>>>>>>> for the entire operation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> David
>>>>>>>>>>>>> -----
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yasumasa
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [3] 
>>>>>>>>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/runtime/thread.cpp#l671
>>>>>>>>>>>>>> [4] 
>>>>>>>>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/runtime/thread.cpp#l684
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However there is still a potential bug as the thread 
>>>>>>>>>>>>>>> reported as the owner may not be suspended at the time 
>>>>>>>>>>>>>>> we first see it, and may release the monitor, but then 
>>>>>>>>>>>>>>> it may get suspended before we call:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   owning_thread = 
>>>>>>>>>>>>>>> Threads::owning_thread_from_monitor_owner(tlh.list(), 
>>>>>>>>>>>>>>> owner);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and so we think it is still the monitor owner and 
>>>>>>>>>>>>>>> proceed to query the monitor information in a racy way. 
>>>>>>>>>>>>>>> This can't happen when suspension itself requires a 
>>>>>>>>>>>>>>> safepoint as the current thread won't go to that 
>>>>>>>>>>>>>>> safepoint during this code. However, if suspension is 
>>>>>>>>>>>>>>> implemented via a direct handshake with the target 
>>>>>>>>>>>>>>> thread then we have a problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yasumasa
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1] 
>>>>>>>>>>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/prims/jvmtiEnvBase.cpp#l973
>>>>>>>>>>>>>>>> [2] 
>>>>>>>>>>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/prims/jvmtiEnvBase.cpp#l996
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>