RFR: JDK-8208473: [TESTBUG] nsk/jdb/exclude/exclude001/exclude001.java is timing out on solaris-sparc again

Chris Plummer chris.plummer at oracle.com
Fri Sep 28 20:08:22 UTC 2018


That's not quite the reason why kill002 failed. When doing a kill 
command, there are two prompts printed. One is printed immediately after 
issuing the async kill command, and the other after the async kill 
command completes. We first wait for the "killed" message, and then wait 
for the prompt. The problem is that waitForPrompt() is only waiting for 
the first prompt when it should wait for the second. Normally the second 
has already arrived with the "killed" output, but sometimes it doesn't, 
so we return with the second prompt still pending, and later the second 
prompt mistakenly shows up in the reply of the next command.

https://bugs.openjdk.java.net/browse/JDK-8208473?focusedCommentId=14210269&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14210269

So we aren't dealing with a timeout in this case and we aren't dealing 
with receiveReply() being called for debug purposes and improperly 
waiting for the prompt. But there is a bit more going on than just this 
in the kill002 failure. After the last killed message is received 
(prematurely before the second prompt), the test does:

         reply = jdb.receiveReplyFor(JdbCommand.threads);
         reply = jdb.receiveReplyFor(JdbCommand.cont);

The "threads" command returns early because of the arrival of the second 
prompt from the last "kill" command, which means the "threads" command 
also returns before its own prompt is received. It then does the "cont" 
command, which then itself terminates early because it receives the 
prompt belonging to the "threads" command. Then we move on to the 
following code, which has a bug in it:

         // make sure the debugger is at a breakpoint
         if (!jdb.isAtBreakpoint(reply, LAST_BREAK)) {
             log.display("Expected breakpoint has not been hit yet");
             jdb.waitForMessage(0, LAST_BREAK);
         }
         log.display("Breakpoint has been hit");

The isAtBreakpoint() fails because the "cont" command has not actually 
completed yet. So we then do the waitForMessage() to wait for the 
breakpoint message. However, this code is buggy. It starts to look for 
LAST_BREAK starting from the very start of the stdout. It will find it 
immediately because it's there near the start of the log from when the 
breakpoint was setup:

Sending command: stop in nsk.jdb.kill.kill002.kill002a.breakHere
reply[0]: Set breakpoint nsk.jdb.kill.kill002.kill002a.breakHere
reply[1]: main[1]

So waitForMessage() does not do the job it was intended to do, and we 
move on to the next part of the test which is issuing the eval command. 
It fails to produce the proper output because the thread is not yet 
suspended at the breakpoint. The eval command output not only includes 
the indication of failure, it includes the simple prompt as proof that 
the thread was not suspended, and then also the breakpoint output, which 
should have arrived before the eval command was ever executed.

Sending command: eval nsk.jdb.kill.kill002.kill002a.notKilled
reply[0]: com.sun.tools.example.debug.expr.ParseException: Name unknown: 
nsk.jdb.kill.kill002.kill002a.notKilled
reply[1]:  nsk.jdb.kill.kill002.kill002a.notKilled = null
reply[2]: >
reply[3]: Breakpoint hit: "thread=main", 
nsk.jdb.kill.kill002.kill002a.breakHere(), line=38 bci=0
reply[4]: 38        static void breakHere () {}
reply[5]:
reply[6]: main[1]

So for kill002 we need to fix the handling of kill commands so they wait 
for the second prompt. I think once we do this, we no longer need the 
waitForMessage() error handling logic to wait for the breakpoint. We 
should be at the breakpoint after receiveReplyFor(JdbCommand.cont) 
returns, and we should just be verifying that we are there, not wait to 
get there.

Chris

On 9/28/18 11:29 AM, Gary Adams wrote:
> The kill002 logs I saw returned after the matching message e.g. "xxx 
> killed"
> When the pending output was dumped it included next prompt,
> but the next prompt was not returned. This could have been a
> similar situation to the exclude timeout, which then ended up
> doing a wait for prompt as part of gathering the pending
> output.
>
> On 9/28/18, 2:23 PM, Chris Plummer wrote:
>> I'm fine with keeping the Jdb changes out of this exclude001 fix, but 
>> I don't see what it has to do with kill002. IIRC the bug with that 
>> test is that it is only waiting for the first of two prompts that 
>> come in, so sometimes it returns the reply too early.
>>
>> Chris
>>
>> On 9/28/18 11:20 AM, Gary Adams wrote:
>>> I think I'd pref to just submit the exclude001 fix at this time 
>>> without the
>>> Jdb.java changes.
>>>
>>> We can revisit this wait for message diagnostic clean up along with 
>>> the kill002
>>> fix.
>>>
>>> On 9/28/18, 2:15 PM, Chris Plummer wrote:
>>>> Yep. My suggestion was to use this version of receiveReply(), but 
>>>> also have receiveReply() not call waitForPrompt() if count was 0. I 
>>>> think it would be good to test with this change in place, but with 
>>>> a smaller timeout so you can reproduce the timeout error and test 
>>>> this receiveReply() call. Other than that the changes look fine to me.
>>>>
>>>> thanks,
>>>>
>>>> Chris
>>>>
>>>> On 9/28/18 11:04 AM, Alex Menkov wrote:
>>>>> Hi Gary,
>>>>>
>>>>> receiveReply(startPos, false, 0)
>>>>> calls
>>>>> waitForPrompt(startPos, compoundPromptOnly, count);
>>>>>
>>>>> and waitForPrompt has:
>>>>>         if (count <= 0) {
>>>>>             throw new TestBug("Wrong number of prompts count in 
>>>>> Jdb.waitForPrompt(): " + count);
>>>>>         }
>>>>>
>>>>> So We will get "Wrong number of prompts count" failure?
>>>>>
>>>>> --alex
>>>>>
>>>>> On 09/28/2018 04:47, Gary Adams wrote:
>>>>>> Revised webrev:
>>>>>>
>>>>>>    Webrev: http://cr.openjdk.java.net/~gadams/8208473/webrev.01/
>>>>>>
>>>>>> The final fix includes
>>>>>>      - updated the timeout for the test (should handle sparc 
>>>>>> debug slowness)
>>>>>>      - wait for explicit prompts from cont command (avoids 
>>>>>> confusion from "int[2]")
>>>>>>      - fixed a typo in an exclude pattern ("jdk.*")
>>>>>>      - on wait for message timeout, don't wait for prompt
>>>>>>         when dumping current
>>>>>>
>>>>>> Should have another reviewer in addition to Chris.
>>>>>>
>>>>>> On 9/27/18, 3:12 PM, Chris Plummer wrote:
>>>>>>> The extra check after timing out doesn't seem like it should 
>>>>>>> help. You've already called findMessage() 2100 times at 200ms 
>>>>>>> intervals. Why would one more call after that help? I think it 
>>>>>>> might be the receiveReply() call that is fixing it. It does a 
>>>>>>> waitForPrompt(), so this probably gives us another 420000 ms for 
>>>>>>> the prompt to come in. This call to receiveReply() is actually a 
>>>>>>> bug itself since we are doing it just to print the current 
>>>>>>> buffer, not the buffer after waiting for a prompt to come in.
>>>>>>>
>>>>>>> In any case, looks like this prompt is taking more than 420200 
>>>>>>> milliseconds to come in, but does eventually come in, and extra 
>>>>>>> waiting in receiveReply() is what is causing you to eventually 
>>>>>>> see the prompt. I think bumping up the timeout to 600 and the 
>>>>>>> waittime to 10 is the proper fix here.
>>>>>>>
>>>>>>> And to address the receiveReply() issue, I'd suggest calling it 
>>>>>>> using receiveReply(startPos, false, 0), where 0 is the prompt 
>>>>>>> count, and have receiveReply() not wait for a prompt when the 
>>>>>>> count is 0.
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> On 9/27/18 11:44 AM, Gary Adams wrote:
>>>>>>>> Speaking of not being bullet proof, during testing of the fix to
>>>>>>>> wait for a specific prompt an intermittent failure was observed.
>>>>>>>> ...
>>>>>>>>
>>>>>>>> Sending command: trace methods 0x2a9
>>>>>>>> reply[0]: MyThread-0[1]
>>>>>>>> Sending command: cont
>>>>>>>> WARNING: message not recieved: MyThread-0[1]
>>>>>>>> Remaining debugger output follows:
>>>>>>>> reply[0]:>
>>>>>>>> reply[1]: Method exited: return value =<void value>, 
>>>>>>>> "thread=MyThread-0", nsk.jdb.exclude.exclude001.MyThread.run(), 
>>>>>>>> line=93 bci=14
>>>>>>>> reply[2]: 93        }
>>>>>>>> reply[3]:
>>>>>>>> reply[4]: MyThread-0[1]
>>>>>>>> # ERROR: Caught unexpected exception while executing the test: 
>>>>>>>> nsk.share.Failure: Expected message not received during 420200 
>>>>>>>> milliseconds:
>>>>>>>> ...
>>>>>>>>
>>>>>>>> The wait for message times out looking for "MyThread-0[1]".
>>>>>>>> A WARNING is printed and the "remaining debugger output"
>>>>>>>> shows that "MyThread-0[1]" is in the buffer.
>>>>>>>>
>>>>>>>> I'm still investigating why the message match is not found.
>>>>>>>>
>>>>>>>> Adding a final check before failing the wait for message
>>>>>>>> seems to workaround the problem.
>>>>>>>>
>>>>>>>> diff --git 
>>>>>>>> a/test/hotspot/jtreg/vmTestbase/nsk/share/jdb/Jdb.java 
>>>>>>>> b/test/hotspot/jtreg/vmTestbase/nsk/share/jdb/Jdb.java
>>>>>>>> --- a/test/hotspot/jtreg/vmTestbase/nsk/share/jdb/Jdb.java
>>>>>>>> +++ b/test/hotspot/jtreg/vmTestbase/nsk/share/jdb/Jdb.java
>>>>>>>> @@ -515,10 +515,11 @@
>>>>>>>>          long delta = 200; // time in milliseconds to wait at 
>>>>>>>> every iteration.
>>>>>>>>          long total = 0;    // total time has waited.
>>>>>>>>          long max = 
>>>>>>>> getLauncher().getJdbArgumentHandler().getWaitTime() * 60 * 
>>>>>>>> 1000; // maximum time to wait.
>>>>>>>> +        int found = 0;
>>>>>>>>
>>>>>>>>          Object dummy = new Object();
>>>>>>>>          while ((total += delta) <= max) {
>>>>>>>> -            int found = 0;
>>>>>>>> +            found = 0;
>>>>>>>>
>>>>>>>>              // search for message
>>>>>>>>              {
>>>>>>>> @@ -553,6 +554,12 @@
>>>>>>>>          log.display("WARNING: message not recieved: " + message);
>>>>>>>>          log.display("Remaining debugger output follows:");
>>>>>>>>          receiveReply(startPos);
>>>>>>>> +
>>>>>>>> +        // One last chance
>>>>>>>> +        found = findMessage(startPos, message);
>>>>>>>> +        if (found > 0) {
>>>>>>>> +            return found;
>>>>>>>> +        }
>>>>>>>>          throw new Failure("Expected message not received 
>>>>>>>> during " + total + " milliseconds:"
>>>>>>>>                              + "\n\t" + message);
>>>>>>>>      }
>>>>>>>>
>>>>>>>>
>>>>>>>> On 9/20/18, 5:47 PM, Chris Plummer wrote:
>>>>>>>>> Looks good. Still not bullet proof, but I'm not sure it's 
>>>>>>>>> possible to write tests like this in a way that will work no 
>>>>>>>>> matter what output is produced by the method enter/exit events.
>>>>>>>>>
>>>>>>>>> Chris
>>>>>>>>>
>>>>>>>>> On 9/20/18 10:59 AM, Gary Adams wrote:
>>>>>>>>>> The test failure has been identified due to the "int[2]"
>>>>>>>>>> being misrecognized as a compound prompt.  This caused a cont
>>>>>>>>>> command to be sent prematurely.
>>>>>>>>>>
>>>>>>>>>> The proposed fix waits for the correct prompt before
>>>>>>>>>> advancing to the next command.
>>>>>>>>>>
>>>>>>>>>>   Webrev: http://cr.openjdk.java.net/~gadams/8208473/webrev/
>>>>>>>>>>   Issue: https://bugs.openjdk.java.net/browse/JDK-8208473
>>>>>>>>>>
>>>>>>>>>> Testing is in progress.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>>
>




More information about the serviceability-dev mailing list