RFR (XS) 6471769: Error: assert(_cur_stack_depth == count_frames(), "cur_stack_depth out of sync")

Mon Feb 3 16:04:10 PST 2014

On 2/3/14 2:54 PM, Daniel D. Daugherty wrote:
> On 2/3/14 1:13 PM, serguei.spitsyn at oracle.com wrote:
>> Dan,
>>
>> Thank you for reviewing!
>> Some comments are below.
>>
>> On 2/3/14 7:51 AM, Daniel D. Daugherty wrote:
>>> On 1/31/14 7:58 PM, serguei.spitsyn at oracle.com wrote:
>>>> Please, review the fix for:
>>>>   https://bugs.openjdk.java.net/browse/JDK-6471769
>>>>
>>>>
>>>> Open webrev:
>>>> http://cr.openjdk.java.net/~sspitsyn/webrevs/2014/hotspot/6471769-JVMTI-FRAME/ 
>>>>
>>>
>>> src/share/vm/prims/jvmtiEnv.cpp
>>>     Thumbs up on the code change.
>>>
>>>     I've never been a fan of the:
>>>
>>>         if (is_thread_fully_suspended(...))
>>>             do-something-direct
>>>         else
>>>             do-something-via-a-safepoint
>>>
>>>     model. We don't have any mechanism in the VM to keep the
>>>     is_thread_fully_suspended() constraint true over the
>>>     execution time of the "do-something-direct" operation so
>>>     we've always been exposed to races.
>>
>> Thank you for this confirmation.
>> So that, we are in sync here.
>>
>>
>>> Also, I think this
>>>     model is used for several other JVM/TI calls so you might
>>>     want to visit those also with an eye on safety.
>>
>> Yes, I'm looking to isolate and double check other places like this 
>> in the code.
>> Another piece of work is to isolate other bugs on this topic.
>>
>>
>>>
>>>     For this specific change, I think the comment would be
>>>     better as:
>>>
>>>     // It is only safe to perform the direct operation on the current
>>>     // thread. All other usage needs to use a vm-op for safety.
>>
>> It is taken, thanks!
>>
>>> In particular, I don't agree with this part of the comment:
>>>
>>>     // Performance-wise the only important case is when current 
>>> thread is
>>>     // the target thread.
>>>
>>>     I'm not sure how you can state that performance on the current 
>>> thread
>>>     is the only important case, but maybe I don't understand what 
>>> you're
>>>     really trying to say here.
>>
>> I agree, the comment is not clear.
>> Probably, the comment had to say: "the only case where we can keep 
>> performance
>> without loosing safety is the target thread is current ...".
>> In fact, collecting data at a safepoint is the only way to make it safe.
>> If the target thread is not current then:
>>  - a) thread has been suspended: already slow path, so that 
>> performance is not that critical
>>  - b) thread has not been suspended: the information we provide is 
>> transient,
>>         safepoint is the only way to get it safe (is is correct?)
>
> Yes, the only safe way to get info for another thread is via
> a safepoint operation.

Ok, thanks.

>>>
>>> Now for the more general question with respect to this bug: 6471769
>>> I've having trouble connecting this change (which I like) to this
>>> particular bug report.
>>>
>>> The crash in the bug's description looks like this:
>>>
>>> # Internal Error 
>>> (/net/prt-solamd64-q1-2/PrtBuildDir/workspace/src/share/vm/prims/jvmtiThreadState.cpp, 
>>> 273), pid=9561, tid=2
>>> #
>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM 
>>> (20060914135846.dcubed.service_hs_b02_merge.2-debug compiled mode)
>>> #
>>> # Error: assert(_cur_stack_depth == count_frames(),"cur_stack_depth 
>>> out of sync")
>>>
>>> V [libjvm.so+0x1217206];; void VMError::report_and_die()+0x606
>>> V [libjvm.so+0x667841];; void report_assertion_failure(const 
>>> char*,int,const char*)+0x61
>>> V [libjvm.so+0xca9557];; int JvmtiThreadState::cur_stack_depth()+0x6e7
>>> V [libjvm.so+0xc1caaa];; void 
>>> JvmtiExport::post_method_exit(JavaThread*,methodOop,frame)+0x21ba
>>> V [libjvm.so+0x8385de];; void 
>>> InterpreterRuntime::post_method_exit(JavaThread*)+0x21e
>>> j java.io.ObjectInputStream$BlockDataInputStream.getBlockDataMode()Z+4
>>>
>>> In the above crash, the current thread is trying to post a
>>> JVMTI_EVENT_METHOD_EXIT and during the event posting code, it
>>> calls cur_stack_depth() which fails an assert because the
>>> cached stack depth doesn't match the current queried value:
>>>
>>>    282  int JvmtiThreadState::cur_stack_depth() {
>>>    283    uint32_t debug_bits = 0;
>>>    284    guarantee(JavaThread::current() == get_thread() ||
>>>    285      JvmtiEnv::is_thread_fully_suspended(get_thread(), false, 
>>> &debug_bits
>>> ),
>>>    286      "must be current thread or suspended");
>>>    287
>>>    288    if (!is_interp_only_mode() || _cur_stack_depth == 
>>> UNKNOWN_STACK_DEPTH)
>>>  {
>>>    289      _cur_stack_depth = count_frames();
>>>    290    } else {
>>>    291      // heavy weight assert
>>>    292      assert(_cur_stack_depth == count_frames(),
>>>    293             "cur_stack_depth out of sync");
>>>    294    }
>>>    295    return _cur_stack_depth;
>>>    296  }
>>>
>>>
>>> This is not an unsafe query from one thread to another target thread
>>> so I don't see how the proposed change will address this failure mode.
>>> In particular, the assertion is troubling because it tells me that
>>> the cached stack depth for the current thread is either wrong or has
>>> been corrupted.
>>
>> The above are consequences of unsafe operations we are doing.
>> Current thread should not calculate frame count of target thread (if 
>> different) other than at a safepoint.
>> We assume the thread is suspended but, in fact, it was not.
>> So that we lost safety here and all manifestation we observe are the 
>> consequences.
>
> For at least the stack trace shown in the description of 6471769,
> it is not an unsafe case because the thread is operating on itself.
> All I'm saying here is that 6471769 is not the right bug to use for
> this fix...

Thank you for sharing the concern.
I was not able to reproduce the issue and so, did not prove my theory 
(which can be wrong).
But you have a pretty valid point here.
I still think, the stack trace with the assert is a consequence of a 
previous unsafe calculation on another thread.
But that calculation is probably different than that I'm fixing in this 
webrev.

>> This bug covers the frames miscount issue.
>> In the fastdebug mode we may get the assert as in this bug report:
>>   https://bugs.openjdk.java.net/browse/JDK-8032223
>
> This bug: JDK-8032223 seems to be a perfect match for code that you
> are proposing the change in your current webrev. You have a thread
> ("JDWP Transport Listener: dt_socket") that is trying to query the
> frame count on another thread... That target JavaThread met the
> is_thread_fully_suspended() criteria when execution went down the
> direct path, but later the target stopped meeting that criteria...
>
> And boom...
>
>> There is another potential race that I forgot to check and fix but 
>> David pointed out.
>> I will try to cover it in the next version of webrev.
>
> See the separate reply that I sent on that part...
>
>
>>
>> It is still possible there can be other issues here (as you say).
>> But it is hard to make sure it is the case because it is extremely 
>> hard to reproduce (you know it!).
>> I'd suggest to resolve it step-by-step:
>>   - fix the known safety issues
>>   - watch the nightly if anything else is left, file and fix new bugs
>
> This sounds like a good plan.
>
>
>>
>>>
>>> To reiterate:
>>>
>>> - I like the code change!
>>> - I would like to see the comment tweaked a bit
>>>
>>> You should consider going forward with this change using a different
>>> bug ID, perhaps something like:
>>>
>>>    JDK-NNNNNNN improve safety of JVM/TI GetFrameCount()
>>
>> There is already bug on this topic (and there are even more dups):
>>   https://bugs.openjdk.java.net/browse/JDK-8032223
>>
>> But I still believe, the issue is the same ...
>
> I don't think the issues in JDK-6471769 and JDK-8032223 are the same;
> well not the same for the crash in the description of JDK-6471769. I
> didn't look closely at the other crashes...

Ok.
I will reopen the JDK-8032223 and use it for this fix.

Thank you for helping to sort this out!
Serguei

>
>
> Dan
>
>
>>
>>
>> Thanks!
>> Serguei
>>
>>>
>>> Dan
>>>
>>>
>>>
>>>>
>>>> Summary:
>>>>
>>>>   There is a general issue in the suspend equivalent condition 
>>>> mechanism:
>>>>   Two subsequent calls to the JvmtiEnv::is_thread_fully_suspended() 
>>>> may return different results:
>>>>     - 1-st: true
>>>>     - 2-nd: false
>>>>
>>>>   This more generic suspend equivalent issue is covered by another 
>>>> bug:
>>>>     https://bugs.openjdk.java.net/browse/JDK-6280037
>>>>
>>>>   The bug to fix in this review is a specific manifestation of the 
>>>> 6280037
>>>>   in the JVMTI GetFrameCount() that has a big impact on the SQE 
>>>> nightly.
>>>>   It is on the Test Stabilization radar (as well as the 6280037).
>>>>   There are many tests intermittently failing because of this.
>>>>
>>>>   The webrev for review is a one-liner work around the 6280037 for 
>>>> the GetFrameCount().
>>>>
>>>>   The JVMTI GetFrameCount() spec tells:
>>>>     "If this function is called for a thread actively executing 
>>>> bytecodes (for example,
>>>>      not the current thread and not suspended), the information 
>>>> returned is transient."
>>>>
>>>>   So, it is Ok to call the GetFrameCount() for non-suspended target 
>>>> threads.
>>>>   To achieve safety, the frame count for non-suspended threads is 
>>>> calculated at a safepoint.
>>>>   It should be Ok and more safe to do the same for suspended 
>>>> threads as well.
>>>>   There is no big performance impact because it is already on a 
>>>> slow path.
>>>>   It is still important to avoid safepointing when the target 
>>>> thread is current.
>>>>
>>>>   The bug 6280037 should go out of the Test Stabilization radar 
>>>> (remove the svc-nightly label)
>>>>   as the most of the impacted tests are covered by the 6471769.
>>>>
>>>>
>>>> Testing:
>>>>   In progress: nsk.jvmti, nsk.jdi, nsk.jdwp, impacted JTreg tests
>>>>
>>>>
>>>> Thanks,
>>>> Serguei
>>>>
>>>
>>
>