RFR: 8247729: GetObjectMonitorUsage() might return inconsistent information
serguei.spitsyn at oracle.com
serguei.spitsyn at oracle.com
Wed Jun 17 15:42:04 UTC 2020
Hi Yasumasa,
This fix is not enough.
The function JvmtiEnvBase::get_object_monitor_usage works in two modes:
in VMop and non-VMop.
The non-VMop mode has to be removed.
Thanks,
Serguei
On 6/17/20 02:18, Yasumasa Suenaga wrote:
> (Change subject for RFR)
>
> Hi,
>
> I filed it to JBS and upload a webrev for it.
> Could you review it?
>
> JBS: https://bugs.openjdk.java.net/browse/JDK-8247729
> webrev: http://cr.openjdk.java.net/~ysuenaga/JDK-8247729/webrev.00/
>
> This change has passed tests on submit repo.
> Also I tested it with serviceability/jvmti and vmTestbase/nsk/jvmti on
> Linux x64.
>
>
> Thanks,
>
> Yasumasa
>
>
> On 2020/06/17 14:37, serguei.spitsyn at oracle.com wrote:
>> Yes. It seems we have a consensus.
>> Thank you for taking care about it.
>>
>> Thanks,
>> Serguei
>>
>>
>> On 6/16/20 18:34, David Holmes wrote:
>>>> Ok, may I file it to JBS and fix it?
>>>
>>> Go for it! :)
>>>
>>> Cheers,
>>> David
>>>
>>> On 17/06/2020 10:23 am, Yasumasa Suenaga wrote:
>>>> On 2020/06/17 8:47, serguei.spitsyn at oracle.com wrote:
>>>>> Hi Dan, David and Yasumasa,
>>>>>
>>>>>
>>>>> On 6/16/20 07:39, Daniel D. Daugherty wrote:
>>>>>> On 6/15/20 9:28 PM, David Holmes wrote:
>>>>>>> On 16/06/2020 10:57 am, Daniel D. Daugherty wrote:
>>>>>>>> On 6/15/20 7:19 PM, David Holmes wrote:
>>>>>>>>> On 16/06/2020 8:40 am, Daniel D. Daugherty wrote:
>>>>>>>>>> On 6/15/20 6:14 PM, David Holmes wrote:
>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>
>>>>>>>>>>> On 15/06/2020 11:38 pm, Daniel D. Daugherty wrote:
>>>>>>>>>>>> On 6/15/20 3:26 AM, David Holmes wrote:
>>>>>>>>>>>>> On 15/06/2020 4:02 pm, Yasumasa Suenaga wrote:
>>>>>>>>>>>>>> Hi David,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2020/06/15 14:15, David Holmes wrote:
>>>>>>>>>>>>>>> Hi Yasumasa,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 15/06/2020 2:49 pm, Yasumasa Suenaga wrote:
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I wonder why JvmtiEnvBase::get_object_monitor_usage()
>>>>>>>>>>>>>>>> (implementation of GetObjectMonitorUsage()) does not
>>>>>>>>>>>>>>>> perform at safepoint.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> GetObjectMonitorUsage will use a safepoint if the target
>>>>>>>>>>>>>>> is not suspended:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> jvmtiError
>>>>>>>>>>>>>>> JvmtiEnv::GetObjectMonitorUsage(jobject object,
>>>>>>>>>>>>>>> jvmtiMonitorUsage* info_ptr) {
>>>>>>>>>>>>>>> JavaThread* calling_thread = JavaThread::current();
>>>>>>>>>>>>>>> jvmtiError err =
>>>>>>>>>>>>>>> get_object_monitor_usage(calling_thread, object, info_ptr);
>>>>>>>>>>>>>>> if (err == JVMTI_ERROR_THREAD_NOT_SUSPENDED) {
>>>>>>>>>>>>>>> // Some of the critical threads were not suspended.
>>>>>>>>>>>>>>> go to a safepoint and try again
>>>>>>>>>>>>>>> VM_GetObjectMonitorUsage op(this, calling_thread,
>>>>>>>>>>>>>>> object, info_ptr);
>>>>>>>>>>>>>>> VMThread::execute(&op);
>>>>>>>>>>>>>>> err = op.result();
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>> return err;
>>>>>>>>>>>>>>> } /* end GetObject */
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I saw this code, so I guess there are some cases when
>>>>>>>>>>>>>> JVMTI_ERROR_THREAD_NOT_SUSPENDED is not returned from
>>>>>>>>>>>>>> get_object_monitor_usage().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Monitor owner would be acquired from monitor object at
>>>>>>>>>>>>>>>> first [1], but it would perform concurrently.
>>>>>>>>>>>>>>>> If owner thread is not suspended, the owner might be
>>>>>>>>>>>>>>>> changed to others in subsequent code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> For example, the owner might release the monitor before
>>>>>>>>>>>>>>>> [2].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The expectation is that when we find an owner thread it
>>>>>>>>>>>>>>> is either suspended or not. If it is suspended then it
>>>>>>>>>>>>>>> cannot release the monitor. If it is not suspended we
>>>>>>>>>>>>>>> detect that and redo the whole query at a safepoint.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think the owner thread might resume unfortunately after
>>>>>>>>>>>>>> suspending check.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes you are right. I was thinking resuming also required a
>>>>>>>>>>>>> safepoint but it only requires the Threads_lock. So yes
>>>>>>>>>>>>> the code is wrong.
>>>>>>>>>>>>
>>>>>>>>>>>> Which code is wrong?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, a rogue resume can happen when the
>>>>>>>>>>>> GetObjectMonitorUsage() caller
>>>>>>>>>>>> has started the process of gathering the information while
>>>>>>>>>>>> not at a
>>>>>>>>>>>> safepoint. Thus the information returned by
>>>>>>>>>>>> GetObjectMonitorUsage()
>>>>>>>>>>>> might be stale, but that's a bug in the agent code.
>>>>>>>>>>>
>>>>>>>>>>> The code tries to make sure that it either collects data
>>>>>>>>>>> about a monitor owned by a thread that is suspended, or else
>>>>>>>>>>> it collects that data at a safepoint. But the owning thread
>>>>>>>>>>> can be resumed just after the code determined it was
>>>>>>>>>>> suspended. The monitor can then be released and the
>>>>>>>>>>> information gathered not only stale but potentially
>>>>>>>>>>> completely wrong as it could now be owned by a different
>>>>>>>>>>> thread and will report that thread's entry count.
>>>>>>>>>>
>>>>>>>>>> If the agent is not using SuspendThread(), then as soon as
>>>>>>>>>> GetObjectMonitorUsage() returns to the caller the information
>>>>>>>>>> can be stale. In fact as soon as the implementation returns
>>>>>>>>>> from the safepoint that gathered the info, the target thread
>>>>>>>>>> could have moved on.
>>>>>>>>>
>>>>>>>>> That isn't the issue. That the info is stale is fine. But the
>>>>>>>>> expectation is that the information was actually an accurate
>>>>>>>>> snapshot of the state of the monitor at some point in time.
>>>>>>>>> The current code does not ensure that.
>>>>>>>>
>>>>>>>> Please explain. I clearly don't understand why you think the info
>>>>>>>> returned isn't "an accurate snapshot of the state of the monitor
>>>>>>>> at some point in time".
>>>>>>>
>>>>>>> Because it may not be a "snapshot" at all. There is no
>>>>>>> atomicity**. The reported owner thread may not own it any longer
>>>>>>> when the entry count is read, so straight away you may have the
>>>>>>> wrong entry count information. The set of threads trying to
>>>>>>> acquire the monitor, or wait on the monitor can change in
>>>>>>> unexpected ways. It would be possible for instance to report the
>>>>>>> same thread as being the owner, being blocked trying to enter
>>>>>>> the monitor, and being in the wait-set of the monitor -
>>>>>>> apparently all at the same time!
>>>>>>>
>>>>>>> ** even if the owner is suspended we don't have complete
>>>>>>> atomicity because threads can join the set of threads trying to
>>>>>>> enter the monitor (unless they are all suspended).
>>>>>>
>>>>>> Consider the case when the monitor's owner is _not_ suspended:
>>>>>>
>>>>>> - GetObjectMonitorUsage() uses a safepoint to gather the info
>>>>>> about
>>>>>> the object's monitor. Since we're at a safepoint, the info that
>>>>>> we are gathering cannot change until we return from the
>>>>>> safepoint.
>>>>>> It is a snapshot and a valid one at that.
>>>>>>
>>>>>> Consider the case when the monitor's owner is suspended:
>>>>>>
>>>>>> - GetObjectMonitorUsage() will gather info about the object's
>>>>>> monitor while _not_ at a safepoint. Assuming that no other
>>>>>> thread is suspended, then entry_count can change because
>>>>>> another thread can block on entry while we are gathering
>>>>>> info. waiter_count and waiters can change if a thread was
>>>>>> in a timed wait that has timed out and now that thread is
>>>>>> blocked on re-entry. I don't think that notify_waiter_count
>>>>>> and notify_waiters can change.
>>>>>>
>>>>>> So in this case, the owner info and notify info is stable,
>>>>>> but the entry_count and waiter info is not stable.
>>>>>>
>>>>>> Consider the case when the monitor is not owned:
>>>>>>
>>>>>> - GetObjectMonitorUsage() will start to gather info about the
>>>>>> object's monitor while _not_ at a safepoint. If it finds a
>>>>>> thread on the entry queue that is not suspended, then it will
>>>>>> bail out and redo the info gather at a safepoint. I just
>>>>>> noticed that it doesn't check for suspension for the threads
>>>>>> on the waiters list so a timed Object.wait() call can cause
>>>>>> some confusion here.
>>>>>>
>>>>>> So in this case, the owner info is not stable if a thread
>>>>>> comes out of a timed wait and reenters the monitor. This
>>>>>> case is no different than if a "barger" thread comes in
>>>>>> after the NULL owner field is observed and enters the
>>>>>> monitor. We'll return that there is no owner, a list of
>>>>>> suspended pending entry thread and a list of waiting
>>>>>> threads. The reality is that the object's monitor is
>>>>>> owned by the "barger" that completely bypassed the entry
>>>>>> queue by virtue of seeing the NULL owner field at exactly
>>>>>> the right time.
>>>>>>
>>>>>> So the owner field is only stable when we have an owner. If
>>>>>> that owner is not suspended, then the other fields are also
>>>>>> stable because we gathered the info at a safepoint. If the
>>>>>> owner is suspended, then the owner and notify info is stable,
>>>>>> but the entry_count and waiter info is not stable.
>>>>>>
>>>>>> If we have a NULL owner field, then the info is only stable
>>>>>> if you have a non-suspended thread on the entry list. Ouch!
>>>>>> That's deterministic, but not without some work.
>>>>>>
>>>>>>
>>>>>> Okay so only when we gather the info at a safepoint is all
>>>>>> of it a valid and stable snapshot. Unfortunately, we only
>>>>>> do that at a safepoint when the owner thread is not suspended
>>>>>> or if owner == NULL and one of the entry threads is not
>>>>>> suspended. If either of those conditions is not true, then
>>>>>> the different pieces of info is unstable to varying degrees.
>>>>>>
>>>>>> As for this claim:
>>>>>>
>>>>>>> It would be possible for instance to report the same thread
>>>>>>> as being the owner, being blocked trying to enter the monitor,
>>>>>>> and being in the wait-set of the monitor - apparently all at
>>>>>>> the same time!
>>>>>>
>>>>>> I can't figure out a way to make that scenario work. If the
>>>>>> thread is seen as the owner and is not suspended, then we
>>>>>> gather info at a safepoint. If it is suspended, then it can't
>>>>>> then be seen as on the entry queue or on the wait queue since
>>>>>> it is suspended. If it is seen on the entry queue and is not
>>>>>> suspended, then we gather info at a safepoint. If it is
>>>>>> suspended on the entry queue, then it can't be seen on the
>>>>>> wait queue.
>>>>>>
>>>>>> So the info instability of this API is bad, but it's not
>>>>>> quite that bad. :-) (That is a small mercy.)
>>>>>>
>>>>>>
>>>>>> Handshaking is not going to make this situation any better
>>>>>> for GetObjectMonitorUsage(). If the monitor is owned and we
>>>>>> handshake with the owner, the stability or instability of
>>>>>> the other fields remains the same as when SuspendThread is
>>>>>> used. Handshaking with all threads won't make the data as
>>>>>> stable as when at a safepoint because individual threads
>>>>>> can resume execution after doing their handshake so there
>>>>>> will still be field instability.
>>>>>>
>>>>>>
>>>>>> Short version: GetObjectMonitorUsage() should only gather
>>>>>> data at a safepoint. Yes, I've changed my mind.
>>>>>
>>>>> I agree with this.
>>>>> The advantages are:
>>>>> - the result is stable
>>>>> - the implementation can be simplified
>>>>>
>>>>> Performance impact is not very clear but should not be that
>>>>> big as suspending all the threads has some overhead too.
>>>>> I'm not sure if using handshakes can make performance better.
>>>>
>>>> Ok, may I file it to JBS and fix it?
>>>>
>>>> Yasumasa
>>>>
>>>>
>>>>> Thanks,
>>>>> Serguei
>>>>>
>>>>>> Dan
>>>>>>
>>>>>>>
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The only way to make sure you don't have stale information is
>>>>>>>>>> to use SuspendThread(), but it's not required. Perhaps the doc
>>>>>>>>>> should have more clear about the possibility of returning stale
>>>>>>>>>> info. That's a question for Robert F.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> GetObjectMonitorUsage says nothing about thread's being
>>>>>>>>>>> suspended so I can't see how this could be construed as an
>>>>>>>>>>> agent bug.
>>>>>>>>>>
>>>>>>>>>> In your scenario above, you mention that the target thread was
>>>>>>>>>> suspended, GetObjectMonitorUsage() was called while the target
>>>>>>>>>> was suspended, and then the target thread was resumed after
>>>>>>>>>> GetObjectMonitorUsage() checked for suspension, but before
>>>>>>>>>> GetObjectMonitorUsage() was able to gather the info.
>>>>>>>>>>
>>>>>>>>>> All three of those calls: SuspendThread(),
>>>>>>>>>> GetObjectMonitorUsage()
>>>>>>>>>> and ResumeThread() are made by the agent and the agent should
>>>>>>>>>> not
>>>>>>>>>> resume the target thread while also calling
>>>>>>>>>> GetObjectMonitorUsage().
>>>>>>>>>> The calls were allowed to be made out of order so agent bug.
>>>>>>>>>
>>>>>>>>> Perhaps. I was thinking more generally about an independent
>>>>>>>>> resume, but you're right that doesn't really make a lot of
>>>>>>>>> sense. But when the spec says nothing about suspension ...
>>>>>>>>
>>>>>>>> And it is intentional that suspension is not required. JVM/DI
>>>>>>>> and JVM/PI
>>>>>>>> used to require suspension for these kinds of get-the-info
>>>>>>>> APIs. JVM/TI
>>>>>>>> intentionally was designed to not require suspension.
>>>>>>>>
>>>>>>>> As I've said before, we could add a note about the data being
>>>>>>>> potentially
>>>>>>>> stale unless SuspendThread is used. I think of it like stat(2).
>>>>>>>> You can
>>>>>>>> fetch the file's info, but there's no guarantee that the info
>>>>>>>> is current
>>>>>>>> by the time you process what you got back. Is it too much
>>>>>>>> motherhood to
>>>>>>>> state that the data might be stale? I could go either way...
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Using a handshake on the owner thread will allow this to be
>>>>>>>>>>> fixed in the future without forcing/using any safepoints.
>>>>>>>>>>
>>>>>>>>>> I have to think about that which is why I'm avoiding talking
>>>>>>>>>> about
>>>>>>>>>> handshakes in this thread.
>>>>>>>>>
>>>>>>>>> Effectively the handshake can "suspend" the thread whilst the
>>>>>>>>> monitor is queried. In effect the operation would create a
>>>>>>>>> per-thread safepoint.
>>>>>>>>
>>>>>>>> I "know" that, but I still need time to think about it and
>>>>>>>> probably
>>>>>>>> see the code to see if there are holes...
>>>>>>>>
>>>>>>>>
>>>>>>>>> Semantically it is no different to the code actually
>>>>>>>>> suspending the owner thread, but it can't actually do that
>>>>>>>>> because suspends/resume don't nest.
>>>>>>>>
>>>>>>>> Yeah... we used have a suspend count back when we tracked
>>>>>>>> internal and
>>>>>>>> external suspends separately. That was a nightmare...
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>>> Dan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> JavaThread::is_ext_suspend_completed() is used to check
>>>>>>>>>>>>>> thread state, it returns `true` when the thread is
>>>>>>>>>>>>>> sleeping [3], or when it performs in native [4].
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sure but if the thread is actually suspended it can't
>>>>>>>>>>>>> continue execution in the VM or in Java code.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This appears to be an optimisation for the assumed
>>>>>>>>>>>>>>> common case where threads are first suspended and then
>>>>>>>>>>>>>>> the monitors are queried.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree with this, but I could find out it from JVMTI
>>>>>>>>>>>>>> spec - it just says "Get information about the object's
>>>>>>>>>>>>>> monitor."
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes it was just an implementation optimisation, nothing to
>>>>>>>>>>>>> do with the spec.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> GetObjectMonitorUsage() might return incorrect
>>>>>>>>>>>>>> information in some case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It starts with finding owner thread, but the owner might
>>>>>>>>>>>>>> be just before wakeup.
>>>>>>>>>>>>>> So I think it is more safe if GetObjectMonitorUsage() is
>>>>>>>>>>>>>> called at safepoint in any case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Except we're moving away from safepoints to using
>>>>>>>>>>>>> Handshakes, so this particular operation will require that
>>>>>>>>>>>>> the apparent owner is Handshake-safe (by entering a
>>>>>>>>>>>>> handshake with it) before querying the monitor. This would
>>>>>>>>>>>>> still be preferable I think to always using a safepoint
>>>>>>>>>>>>> for the entire operation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> David
>>>>>>>>>>>>> -----
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yasumasa
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/runtime/thread.cpp#l671
>>>>>>>>>>>>>> [4]
>>>>>>>>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/runtime/thread.cpp#l684
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However there is still a potential bug as the thread
>>>>>>>>>>>>>>> reported as the owner may not be suspended at the time
>>>>>>>>>>>>>>> we first see it, and may release the monitor, but then
>>>>>>>>>>>>>>> it may get suspended before we call:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> owning_thread =
>>>>>>>>>>>>>>> Threads::owning_thread_from_monitor_owner(tlh.list(),
>>>>>>>>>>>>>>> owner);
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and so we think it is still the monitor owner and
>>>>>>>>>>>>>>> proceed to query the monitor information in a racy way.
>>>>>>>>>>>>>>> This can't happen when suspension itself requires a
>>>>>>>>>>>>>>> safepoint as the current thread won't go to that
>>>>>>>>>>>>>>> safepoint during this code. However, if suspension is
>>>>>>>>>>>>>>> implemented via a direct handshake with the target
>>>>>>>>>>>>>>> thread then we have a problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yasumasa
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/prims/jvmtiEnvBase.cpp#l973
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/76a17c8143d8/src/hotspot/share/prims/jvmtiEnvBase.cpp#l996
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>
More information about the serviceability-dev
mailing list