JDK-8034084: nsk.nsk/jvmti/ThreadStart/threadstart003 Wrong number of thread end events

Tue Aug 14 17:03:19 UTC 2018

I did 1000 testruns with these changes. There were 3 failures due to 
"executor timeout"
but those are not related to these changes (infrastructure).

   Webrev: http://cr.openjdk.java.net/~gadams/8034084/webrev.00/index.html
   Issue: https://bugs.openjdk.java.net/browse/JDK-8034084

On 8/12/18, 7:01 PM, David Holmes wrote:
> Hi Gary,
>
> Good analysis. I agree with your reasoning.
>
> I think these tests were written in a way that assumed that we have to 
> locally timeout to detect things not happening "quickly enough" but as 
> we well know "quickly enough" is highly dependent on a number of 
> factors. And now we rely on the test framework to manage timeouts.
>
> I think removing the timeout from the first wait() and simply allowing 
> the framework timeouts to take effect is a good solution.
>
> Repeating the wait in the second case with twice the timeout is less 
> clear. One second wasn't long enough, but perhaps 3 seconds will be. 
> And it probably will but is not guaranteed. The real guarantee is to 
> wait until you see the expected outcome (either by polling using the 
> timed-wait, or else by inserting a notify elsewhere - which itself may 
> be tricky). And once you start pushing on the wait/notify logic the 
> whole structure of the test potentially changes as you either wait 
> until you get the expected result, or else get timed out by the 
> framework.
>
> I think waiting a bit longer will probably suffice and cause minimum 
> disruption to the test logic, but I'd be included to put the wait into 
> a for-loop and use 1-second timeouts per iteration ie.
>
> // wait for up to 3 seconds
> for (int i = 0 ; i < 3 ; i++) {
>   err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, (jlong)WAIT_TIME);
>   if (endsCount == endsExpected)
>     break;
> }
>
> Thanks,
> David
> On 11/08/2018 4:15 AM, Gary Adams wrote:
>> There's a fairly old bug concerning jvmti thread start and end events.
>> I believe there are 2 possible race conditions in the threadstart003
>> test based on some arbitrary timeout values used in the test itself.
>>
>> Here are a few snippets from the test to illustrate the problem area.
>>
>> In the initial sequence RunAgentThread is used to start a thread
>> with the agentThreadProc entry point.
>>
>>     188        err = (*jvmti)->RawMonitorEnter(jvmti, wait_lock);
>>     189        if (err != JVMTI_ERROR_NONE) {
>>     190            printf("(RawMonitorEnter) unexpected error: %s 
>> (%d)\n",
>>     191                   TranslateError(err), err);
>>     192            result = FAILED;
>>     193        }
>>
>> *   194        err = (*jvmti)->RunAgentThread(jvmti, thr, 
>> agentThreadProc,*
>>     195            NULL, JVMTI_THREAD_MAX_PRIORITY);
>>     196        if (err != JVMTI_ERROR_NONE) {
>>     197            printf("(RunAgentThread) unexpected error: %s 
>> (%d)\n",
>>     198                   TranslateError(err), err);
>>     199            result = FAILED;
>>     200        }
>>
>> *   201        err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 
>> (jlong)WAIT_TIME);*
>>     202        if (err != JVMTI_ERROR_NONE) {
>>     203            printf("(RawMonitorWait) unexpected error: %s 
>> (%d)\n",
>>     204                   TranslateError(err), err);
>>     205            result = FAILED;
>>     206        }
>>     207        err = (*jvmti)->RawMonitorExit(jvmti, wait_lock);
>>     208        if (err != JVMTI_ERROR_NONE) {
>>     209            printf("(RawMonitorExit) unexpected error: %s 
>> (%d)\n",
>>     210                   TranslateError(err), err);
>>     211            result = FAILED;
>>     212        }
>>
>> I believe the test expects to wake up based on the code in 
>> agentThreadProc
>> notification of the wait_lock.
>>
>>     121        err = (*jvmti)->RawMonitorEnter(jvmti, wait_lock);
>>     122        if (err != JVMTI_ERROR_NONE) {
>>     123            printf("(RawMonitorEnter) unexpected error: %s 
>> (%d)\n",
>>     124                   TranslateError(err), err);
>>     125            result = FAILED;
>>     126        }
>> *   127        err = (*jvmti)->RawMonitorNotify(jvmti, wait_lock);*
>>     128        if (err != JVMTI_ERROR_NONE) {
>>     129            printf("(RawMonitorNotify) unexpected error: %s 
>> (%d)\n",
>>     130                   TranslateError(err), err);
>>     131            result = FAILED;
>>     132        }
>>     133        err = (*jvmti)->RawMonitorExit(jvmti, wait_lock);
>>     134        if (err != JVMTI_ERROR_NONE) {
>>     135            printf("(RawMonitorExit) unexpected error: %s 
>> (%d)\n",
>>     136                   TranslateError(err), err);
>>     137            result = FAILED;
>>     138        }
>>
>> A second block for the wait_lock does not have a corresponding
>> notify call. I'm guessing the monitor is expected to time out,
>> so it allows for one second to complete the delivery of the
>> thread end event.
>>
>>    213
>>     214        err = (*jvmti)->RawMonitorEnter(jvmti, wait_lock);
>>     215        if (err != JVMTI_ERROR_NONE) {
>>     216            printf("(RawMonitorEnter) unexpected error: %s 
>> (%d)\n",
>>     217                   TranslateError(err), err);
>>     218            result = FAILED;
>>     219        }
>> *   220        err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 
>> (jlong)WAIT_TIME);*
>>     221        if (err != JVMTI_ERROR_NONE) {
>>     222            printf("(RawMonitorWait) unexpected error: %s 
>> (%d)\n",
>>     223                   TranslateError(err), err);
>>     224            result = FAILED;
>>     225        }
>>     226        err = (*jvmti)->RawMonitorExit(jvmti, wait_lock);
>>     227        if (err != JVMTI_ERROR_NONE) {
>>     228            printf("(RawMonitorExit) unexpected error: %s 
>> (%d)\n",
>>     229                   TranslateError(err), err);
>>     230            result = FAILED;
>>     231        }
>>
>> I see 2 possible race conditions here.
>>
>>    - if the thread is slow to start, then the first RawMonitorWait
>>      could timeout. The code would continue to the second RawMonitorWait
>>      which would wake up from the agentThreadProc call to 
>> RawMonitorNotify.
>>
>>    - if the thread ending is slow to complete then the second 
>> RawMonitorWait
>>      could timeout before the thread end event is processed.
>>
>> After the second RawMonitorWait completes, the event notification is 
>> disabled
>> for both thread start and end events. If we disable the end thread 
>> event before
>> the thread end event is delivered the counts will not match as expected.
>>
>>     233        err = (*jvmti)->SetEventNotificationMode(jvmti, 
>> JVMTI_DISABLE,
>>     234                JVMTI_EVENT_THREAD_START, NULL);
>>     235        if (err != JVMTI_ERROR_NONE) {
>>     236            printf("Failed to disable 
>> JVMTI_EVENT_THREAD_START: %s (%d)\n",
>>     237                   TranslateError(err), err);
>>     238            result = FAILED;
>>     239        }
>>     240 *
>>     241        err = (*jvmti)->SetEventNotificationMode(jvmti, 
>> JVMTI_DISABLE,
>>     242                JVMTI_EVENT_THREAD_END, NULL);*
>>     243        if (err != JVMTI_ERROR_NONE) {
>>     244            printf("Failed to disable JVMTI_EVENT_THREAD_END: 
>> %s (%d)\n",
>>     245                   TranslateError(err), err);
>>     246            result = FAILED;
>>     247        }
>>
>> My recommendation is to change the first RawMonitorWait to use 0
>> rather than WAIT_TIME. This means it will wait until notified.
>> For the second RawMonitorWait, we should retry the wait if
>> endsCount does not equal endsExpected.
>>
>>     201        err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 0);
>>
>>
>>     220        err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 
>> (jlong)WAIT_TIME);
>>                if (endsCount != endsExpected) {
>>                    err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 
>> (jlong)WAIT_TIME * 2);
>>                }
>>     221        if (err != JVMTI_ERROR_NONE) {
>>
>> So we remove the first race condition by removing the timeout on the 
>> first wait call.
>> And we accommodate a possible second race condition by retrying the 
>> second wait with
>> an extended delay.
>>
>> An alternative to the retry could be to add a RawMonitorNotify call 
>> in the thread
>> end event callback ThreadEnd() after the endsCount has been incremented.
>>
>> This test fails very infrequently, so it may be difficult to prove 
>> that the specific
>> problem has been resolved. I'll try some testing with the proposed 
>> changes
>> and see if we can work towards getting this test off the ProblemList 
>> for jdk12.
>>
>>