JDK-8034084: nsk.nsk/jvmti/ThreadStart/threadstart003 Wrong number of thread end events

Fri Aug 10 18:15:37 UTC 2018

There's a fairly old bug concerning jvmti thread start and end events.
I believe there are 2 possible race conditions in the threadstart003
test based on some arbitrary timeout values used in the test itself.

Here are a few snippets from the test to illustrate the problem area.

In the initial sequence RunAgentThread is used to start a thread
with the agentThreadProc entry point.

    188        err = (*jvmti)->RawMonitorEnter(jvmti, wait_lock);
    189        if (err != JVMTI_ERROR_NONE) {
    190            printf("(RawMonitorEnter) unexpected error: %s (%d)\n",
    191                   TranslateError(err), err);
    192            result = FAILED;
    193        }

*   194        err = (*jvmti)->RunAgentThread(jvmti, thr, agentThreadProc,*
    195            NULL, JVMTI_THREAD_MAX_PRIORITY);
    196        if (err != JVMTI_ERROR_NONE) {
    197            printf("(RunAgentThread) unexpected error: %s (%d)\n",
    198                   TranslateError(err), err);
    199            result = FAILED;
    200        }

*   201        err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 
(jlong)WAIT_TIME);*
    202        if (err != JVMTI_ERROR_NONE) {
    203            printf("(RawMonitorWait) unexpected error: %s (%d)\n",
    204                   TranslateError(err), err);
    205            result = FAILED;
    206        }
    207        err = (*jvmti)->RawMonitorExit(jvmti, wait_lock);
    208        if (err != JVMTI_ERROR_NONE) {
    209            printf("(RawMonitorExit) unexpected error: %s (%d)\n",
    210                   TranslateError(err), err);
    211            result = FAILED;
    212        }

I believe the test expects to wake up based on the code in agentThreadProc
notification of the wait_lock.

    121        err = (*jvmti)->RawMonitorEnter(jvmti, wait_lock);
    122        if (err != JVMTI_ERROR_NONE) {
    123            printf("(RawMonitorEnter) unexpected error: %s (%d)\n",
    124                   TranslateError(err), err);
    125            result = FAILED;
    126        }
*   127        err = (*jvmti)->RawMonitorNotify(jvmti, wait_lock);*
    128        if (err != JVMTI_ERROR_NONE) {
    129            printf("(RawMonitorNotify) unexpected error: %s (%d)\n",
    130                   TranslateError(err), err);
    131            result = FAILED;
    132        }
    133        err = (*jvmti)->RawMonitorExit(jvmti, wait_lock);
    134        if (err != JVMTI_ERROR_NONE) {
    135            printf("(RawMonitorExit) unexpected error: %s (%d)\n",
    136                   TranslateError(err), err);
    137            result = FAILED;
    138        }

A second block for the wait_lock does not have a corresponding
notify call. I'm guessing the monitor is expected to time out,
so it allows for one second to complete the delivery of the
thread end event.

   213
    214        err = (*jvmti)->RawMonitorEnter(jvmti, wait_lock);
    215        if (err != JVMTI_ERROR_NONE) {
    216            printf("(RawMonitorEnter) unexpected error: %s (%d)\n",
    217                   TranslateError(err), err);
    218            result = FAILED;
    219        }
*   220        err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 
(jlong)WAIT_TIME);*
    221        if (err != JVMTI_ERROR_NONE) {
    222            printf("(RawMonitorWait) unexpected error: %s (%d)\n",
    223                   TranslateError(err), err);
    224            result = FAILED;
    225        }
    226        err = (*jvmti)->RawMonitorExit(jvmti, wait_lock);
    227        if (err != JVMTI_ERROR_NONE) {
    228            printf("(RawMonitorExit) unexpected error: %s (%d)\n",
    229                   TranslateError(err), err);
    230            result = FAILED;
    231        }

I see 2 possible race conditions here.

   - if the thread is slow to start, then the first RawMonitorWait
     could timeout. The code would continue to the second RawMonitorWait
     which would wake up from the agentThreadProc call to RawMonitorNotify.

   - if the thread ending is slow to complete then the second RawMonitorWait
     could timeout before the thread end event is processed.

After the second RawMonitorWait completes, the event notification is 
disabled
for both thread start and end events. If we disable the end thread event 
before
the thread end event is delivered the counts will not match as expected.

    233        err = (*jvmti)->SetEventNotificationMode(jvmti, 
JVMTI_DISABLE,
    234                JVMTI_EVENT_THREAD_START, NULL);
    235        if (err != JVMTI_ERROR_NONE) {
    236            printf("Failed to disable JVMTI_EVENT_THREAD_START: 
%s (%d)\n",
    237                   TranslateError(err), err);
    238            result = FAILED;
    239        }
    240 *
    241        err = (*jvmti)->SetEventNotificationMode(jvmti, 
JVMTI_DISABLE,
    242                JVMTI_EVENT_THREAD_END, NULL);*
    243        if (err != JVMTI_ERROR_NONE) {
    244            printf("Failed to disable JVMTI_EVENT_THREAD_END: %s 
(%d)\n",
    245                   TranslateError(err), err);
    246            result = FAILED;
    247        }

My recommendation is to change the first RawMonitorWait to use 0
rather than WAIT_TIME. This means it will wait until notified.
For the second RawMonitorWait, we should retry the wait if
endsCount does not equal endsExpected.

    201        err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 0);

    220        err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 
(jlong)WAIT_TIME);
               if (endsCount != endsExpected) {
                   err = (*jvmti)->RawMonitorWait(jvmti, wait_lock, 
(jlong)WAIT_TIME * 2);
               }
    221        if (err != JVMTI_ERROR_NONE) {

So we remove the first race condition by removing the timeout on the 
first wait call.
And we accommodate a possible second race condition by retrying the 
second wait with
an extended delay.

An alternative to the retry could be to add a RawMonitorNotify call in 
the thread
end event callback ThreadEnd() after the endsCount has been incremented.

This test fails very infrequently, so it may be difficult to prove that 
the specific
problem has been resolved. I'll try some testing with the proposed changes
and see if we can work towards getting this test off the ProblemList for 
jdk12.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20180810/35ed77c4/attachment-0001.html>