RFR (XS) 6471769: Error: assert(_cur_stack_depth == count_frames(), "cur_stack_depth out of sync")
serguei.spitsyn at oracle.com
serguei.spitsyn at oracle.com
Mon Feb 3 16:04:10 PST 2014
On 2/3/14 2:54 PM, Daniel D. Daugherty wrote:
> On 2/3/14 1:13 PM, serguei.spitsyn at oracle.com wrote:
>> Dan,
>>
>> Thank you for reviewing!
>> Some comments are below.
>>
>> On 2/3/14 7:51 AM, Daniel D. Daugherty wrote:
>>> On 1/31/14 7:58 PM, serguei.spitsyn at oracle.com wrote:
>>>> Please, review the fix for:
>>>> https://bugs.openjdk.java.net/browse/JDK-6471769
>>>>
>>>>
>>>> Open webrev:
>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/hotspot/6471769-JVMTI-FRAME/
>>>>
>>>
>>> src/share/vm/prims/jvmtiEnv.cpp
>>> Thumbs up on the code change.
>>>
>>> I've never been a fan of the:
>>>
>>> if (is_thread_fully_suspended(...))
>>> do-something-direct
>>> else
>>> do-something-via-a-safepoint
>>>
>>> model. We don't have any mechanism in the VM to keep the
>>> is_thread_fully_suspended() constraint true over the
>>> execution time of the "do-something-direct" operation so
>>> we've always been exposed to races.
>>
>> Thank you for this confirmation.
>> So that, we are in sync here.
>>
>>
>>> Also, I think this
>>> model is used for several other JVM/TI calls so you might
>>> want to visit those also with an eye on safety.
>>
>> Yes, I'm looking to isolate and double check other places like this
>> in the code.
>> Another piece of work is to isolate other bugs on this topic.
>>
>>
>>>
>>> For this specific change, I think the comment would be
>>> better as:
>>>
>>> // It is only safe to perform the direct operation on the current
>>> // thread. All other usage needs to use a vm-op for safety.
>>
>> It is taken, thanks!
>>
>>> In particular, I don't agree with this part of the comment:
>>>
>>> // Performance-wise the only important case is when current
>>> thread is
>>> // the target thread.
>>>
>>> I'm not sure how you can state that performance on the current
>>> thread
>>> is the only important case, but maybe I don't understand what
>>> you're
>>> really trying to say here.
>>
>> I agree, the comment is not clear.
>> Probably, the comment had to say: "the only case where we can keep
>> performance
>> without loosing safety is the target thread is current ...".
>> In fact, collecting data at a safepoint is the only way to make it safe.
>> If the target thread is not current then:
>> - a) thread has been suspended: already slow path, so that
>> performance is not that critical
>> - b) thread has not been suspended: the information we provide is
>> transient,
>> safepoint is the only way to get it safe (is is correct?)
>
> Yes, the only safe way to get info for another thread is via
> a safepoint operation.
Ok, thanks.
>>>
>>> Now for the more general question with respect to this bug: 6471769
>>> I've having trouble connecting this change (which I like) to this
>>> particular bug report.
>>>
>>> The crash in the bug's description looks like this:
>>>
>>> # Internal Error
>>> (/net/prt-solamd64-q1-2/PrtBuildDir/workspace/src/share/vm/prims/jvmtiThreadState.cpp,
>>> 273), pid=9561, tid=2
>>> #
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM
>>> (20060914135846.dcubed.service_hs_b02_merge.2-debug compiled mode)
>>> #
>>> # Error: assert(_cur_stack_depth == count_frames(),"cur_stack_depth
>>> out of sync")
>>>
>>> V [libjvm.so+0x1217206];; void VMError::report_and_die()+0x606
>>> V [libjvm.so+0x667841];; void report_assertion_failure(const
>>> char*,int,const char*)+0x61
>>> V [libjvm.so+0xca9557];; int JvmtiThreadState::cur_stack_depth()+0x6e7
>>> V [libjvm.so+0xc1caaa];; void
>>> JvmtiExport::post_method_exit(JavaThread*,methodOop,frame)+0x21ba
>>> V [libjvm.so+0x8385de];; void
>>> InterpreterRuntime::post_method_exit(JavaThread*)+0x21e
>>> j java.io.ObjectInputStream$BlockDataInputStream.getBlockDataMode()Z+4
>>>
>>> In the above crash, the current thread is trying to post a
>>> JVMTI_EVENT_METHOD_EXIT and during the event posting code, it
>>> calls cur_stack_depth() which fails an assert because the
>>> cached stack depth doesn't match the current queried value:
>>>
>>> 282 int JvmtiThreadState::cur_stack_depth() {
>>> 283 uint32_t debug_bits = 0;
>>> 284 guarantee(JavaThread::current() == get_thread() ||
>>> 285 JvmtiEnv::is_thread_fully_suspended(get_thread(), false,
>>> &debug_bits
>>> ),
>>> 286 "must be current thread or suspended");
>>> 287
>>> 288 if (!is_interp_only_mode() || _cur_stack_depth ==
>>> UNKNOWN_STACK_DEPTH)
>>> {
>>> 289 _cur_stack_depth = count_frames();
>>> 290 } else {
>>> 291 // heavy weight assert
>>> 292 assert(_cur_stack_depth == count_frames(),
>>> 293 "cur_stack_depth out of sync");
>>> 294 }
>>> 295 return _cur_stack_depth;
>>> 296 }
>>>
>>>
>>> This is not an unsafe query from one thread to another target thread
>>> so I don't see how the proposed change will address this failure mode.
>>> In particular, the assertion is troubling because it tells me that
>>> the cached stack depth for the current thread is either wrong or has
>>> been corrupted.
>>
>> The above are consequences of unsafe operations we are doing.
>> Current thread should not calculate frame count of target thread (if
>> different) other than at a safepoint.
>> We assume the thread is suspended but, in fact, it was not.
>> So that we lost safety here and all manifestation we observe are the
>> consequences.
>
> For at least the stack trace shown in the description of 6471769,
> it is not an unsafe case because the thread is operating on itself.
> All I'm saying here is that 6471769 is not the right bug to use for
> this fix...
Thank you for sharing the concern.
I was not able to reproduce the issue and so, did not prove my theory
(which can be wrong).
But you have a pretty valid point here.
I still think, the stack trace with the assert is a consequence of a
previous unsafe calculation on another thread.
But that calculation is probably different than that I'm fixing in this
webrev.
>> This bug covers the frames miscount issue.
>> In the fastdebug mode we may get the assert as in this bug report:
>> https://bugs.openjdk.java.net/browse/JDK-8032223
>
> This bug: JDK-8032223 seems to be a perfect match for code that you
> are proposing the change in your current webrev. You have a thread
> ("JDWP Transport Listener: dt_socket") that is trying to query the
> frame count on another thread... That target JavaThread met the
> is_thread_fully_suspended() criteria when execution went down the
> direct path, but later the target stopped meeting that criteria...
>
> And boom...
>
>> There is another potential race that I forgot to check and fix but
>> David pointed out.
>> I will try to cover it in the next version of webrev.
>
> See the separate reply that I sent on that part...
>
>
>>
>> It is still possible there can be other issues here (as you say).
>> But it is hard to make sure it is the case because it is extremely
>> hard to reproduce (you know it!).
>> I'd suggest to resolve it step-by-step:
>> - fix the known safety issues
>> - watch the nightly if anything else is left, file and fix new bugs
>
> This sounds like a good plan.
>
>
>>
>>>
>>> To reiterate:
>>>
>>> - I like the code change!
>>> - I would like to see the comment tweaked a bit
>>>
>>> You should consider going forward with this change using a different
>>> bug ID, perhaps something like:
>>>
>>> JDK-NNNNNNN improve safety of JVM/TI GetFrameCount()
>>
>> There is already bug on this topic (and there are even more dups):
>> https://bugs.openjdk.java.net/browse/JDK-8032223
>>
>> But I still believe, the issue is the same ...
>
> I don't think the issues in JDK-6471769 and JDK-8032223 are the same;
> well not the same for the crash in the description of JDK-6471769. I
> didn't look closely at the other crashes...
Ok.
I will reopen the JDK-8032223 and use it for this fix.
Thank you for helping to sort this out!
Serguei
>
>
> Dan
>
>
>>
>>
>> Thanks!
>> Serguei
>>
>>>
>>> Dan
>>>
>>>
>>>
>>>>
>>>> Summary:
>>>>
>>>> There is a general issue in the suspend equivalent condition
>>>> mechanism:
>>>> Two subsequent calls to the JvmtiEnv::is_thread_fully_suspended()
>>>> may return different results:
>>>> - 1-st: true
>>>> - 2-nd: false
>>>>
>>>> This more generic suspend equivalent issue is covered by another
>>>> bug:
>>>> https://bugs.openjdk.java.net/browse/JDK-6280037
>>>>
>>>> The bug to fix in this review is a specific manifestation of the
>>>> 6280037
>>>> in the JVMTI GetFrameCount() that has a big impact on the SQE
>>>> nightly.
>>>> It is on the Test Stabilization radar (as well as the 6280037).
>>>> There are many tests intermittently failing because of this.
>>>>
>>>> The webrev for review is a one-liner work around the 6280037 for
>>>> the GetFrameCount().
>>>>
>>>> The JVMTI GetFrameCount() spec tells:
>>>> "If this function is called for a thread actively executing
>>>> bytecodes (for example,
>>>> not the current thread and not suspended), the information
>>>> returned is transient."
>>>>
>>>> So, it is Ok to call the GetFrameCount() for non-suspended target
>>>> threads.
>>>> To achieve safety, the frame count for non-suspended threads is
>>>> calculated at a safepoint.
>>>> It should be Ok and more safe to do the same for suspended
>>>> threads as well.
>>>> There is no big performance impact because it is already on a
>>>> slow path.
>>>> It is still important to avoid safepointing when the target
>>>> thread is current.
>>>>
>>>> The bug 6280037 should go out of the Test Stabilization radar
>>>> (remove the svc-nightly label)
>>>> as the most of the impacted tests are covered by the 6471769.
>>>>
>>>>
>>>> Testing:
>>>> In progress: nsk.jvmti, nsk.jdi, nsk.jdwp, impacted JTreg tests
>>>>
>>>>
>>>> Thanks,
>>>> Serguei
>>>>
>>>
>>
>
More information about the serviceability-dev
mailing list