RFR: JDK-8210106: sun/tools/jps/TestJps.java timed out
Daniel D. Daugherty
daniel.daugherty at oracle.com
Mon Dec 10 16:24:41 UTC 2018
On 12/10/18 2:55 AM, David Holmes wrote:
> Hi Dan,
>
>> Gary saw a longest time value of 280 seconds in his testing with
>> fastdebug bits. The default timeout value is 120 seconds with a
>> default timeout factor of 4 in Mach5 gives a total timeout of
>> 480 seconds (8 minutes). With Gary's new default timeout of
>> 360 seconds, we'll have a total timeout value of 1440 seconds
>> (24 minutes).
>
> So based on that it seems Gary's observations are not relevant to what
> has been seen in mach5 because at 280 seconds we should never have
> been timing out!
Correct. Gary's test runs never ran into the same state as the
end-of-August run where the test ran into the 20 minute timeout.
Since those logs are now gone, we cannot discern any more information
about what happened back then...
Dan
>
> I guess these changes won't hurt.
>
> Thanks,
> David
>
> On 8/12/2018 1:12 am, Daniel D. Daugherty wrote:
>> On 12/7/18 7:58 AM, Gary Adams wrote:
>>> On 12/6/18, 7:52 PM, David Holmes wrote:
>>>> Hi Gary,
>>>>
>>>> On 6/12/2018 11:27 pm, Gary Adams wrote:
>>>>> On a local linux-x64-debug build this test consistently runs in
>>>>> less than 40 seconds.
>>>>> On the mach5 testing machines there was a large fluctuation in the
>>>>> time to complete this test.
>>>>>
>>>>> Since the test runs jps with different combinations of arguments,
>>>>> the total
>>>>> test time is dependent on the number of processes running java
>>>>> programs.
>>>>> Since the mach5 test infrastructure runs java programs I have seen
>>>>> a 3X
>>>>> in the amount of output the test produces compared to local test
>>>>> runs.
>>>>>
>>>>> I've run the test several hundred times through mach5 on the
>>>>> slower platforms
>>>>> and then examined the test logs using a 3X setting from the
>>>>> default 120 second
>>>>> jtreg timeout. The slowest reported elapse time from the test logs
>>>>> showed
>>>>> 280 seconds to complete.
>>>>>
>>>>> To improve the reliability of the test to complete, I'd like to
>>>>> increase the
>>>>> timeout to 360 seconds.
>>>>>
>>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8210106
>>>>>
>>>>> Proposed fix:
>>>>>
>>>>> diff --git a/test/jdk/sun/tools/jps/TestJps.java
>>>>> b/test/jdk/sun/tools/jps/TestJps.java
>>>>> --- a/test/jdk/sun/tools/jps/TestJps.java
>>>>> +++ b/test/jdk/sun/tools/jps/TestJps.java
>>>>> @@ -27,7 +27,7 @@
>>>>> * @modules jdk.jartool/sun.tools.jar
>>>>> * @build LingeredAppForJps
>>>>> * @build LingeredApp
>>>>> - * @run main/othervm TestJps
>>>>> + * @run main/othervm/timeout=360 TestJps
>>>>> */
>>>>
>>>> Doesn't that then get scaled by the timeout factor resulting in a
>>>> much longer timeout than the 360 seconds you intended?
>>>>
>>>> For other timeout adjustments the needed time has been divided by
>>>> the timeout factor to get the actual intended timeout.
>>>
>>> This bug was filed fairly recently in Aug 2018.
>>> At that time the timeout and timeout factor were not sufficient
>>> to avoid the test failing.
>>>
>>> The mach5 timeout factors were adjusted recently, so this test may
>>> no longer be an issue.
>>>
>>> If that is true, then we could simply close this bug as "cannot
>>> reproduce".
>>> An argument could be made that the change in timeout factor may be
>>> responsible for fixing a lot more of the intermittent bugs and that
>>> they
>>> should be closed in a similar manner.
>>>
>>> Historically, we could say this particular bug should have had timeouts
>>> reassessed when the infrastructure switched from jprt to mach5 testing
>>> where there were more visible Java processes running.
>>>
>>> Using a higher explicit timeout will not make the test run any longer
>>> than it needs. It will simply allow the test to not be terminated
>>> sooner
>>> in a hung test scenario.
>>>
>>> What is your preference for this particular issue:
>>> - increase the explicit timeout
>>> - close as cannot reproduce attributed to the timeout factor
>>> adjustments
>>>
>>> What would you recommend going forward for other similar issues:
>>> - determine a new explicit timeout
>>> - close if no timeout failures have been observed since the
>>> timeout factor
>>> was raised
>>
>> General guidance that I've been giving folks is that you bump the
>> default timeout when you see timeouts with 'release' bits with
>> a timeout factor of '1'. The default timeout factor in Mach5 is 4
>> so we run both 'release' and 'fastdebug' bits with that timeout
>> factor. That means our total timeout value for 'release' bits is
>> probably a little long and our total timeout value for 'fastdebug'
>> bits is probably just right for a concurrency factor of 12.
>>
>> On my personal servers I use the following timeout factors:
>>
>> release)
>> TIMEOUT_FACTOR="4"
>> ;;
>> fastdebug)
>> TIMEOUT_FACTOR="6"
>> ;;
>> slowdebug)
>> TIMEOUT_FACTOR="12"
>> ;;
>>
>> However, I run with a concurrency factor of 16. I rarely see
>> timeout failures.
>>
>> All that is well and good, but a 'jps' test is a different
>> beast since it is affected by factors outside of most tests.
>> Since the number of java processes running on the system can
>> affect the test, I'm okay with using a default timeout of '360'
>> for this test as a guard against increases in load or...
>>
>> Gary saw a longest time value of 280 seconds in his testing with
>> fastdebug bits. The default timeout value is 120 seconds with a
>> default timeout factor of 4 in Mach5 gives a total timeout of
>> 480 seconds (8 minutes). With Gary's new default timeout of
>> 360 seconds, we'll have a total timeout value of 1440 seconds
>> (24 minutes).
>>
>> Before the recent changes to Mach5, the default timeout factor
>> was 10 so we had a total timeout value of 1200 seconds (20
>> minutes). Gary's change only bumps the total timeout value
>> by 4 minutes from what we had back in August... when this bug
>> was filed... unfortunately, the log for that sighting is gone
>> so all we know is that the test timed out after 20 minutes
>>
>> > Timeout refired 1200 times
>>
>> We know that JTREG timeout handling kicks in at timeout expiration,
>> but without the log we don't know how much longer than 20 minutes
>> the test ran before JTREG killed it. 24 minutes might do it, but...
>>
>> Dan
>>
>>
>>>
>>>
>>>>
>>>> Cheers,
>>>> David
>>>>
>>>
>>>
>>
More information about the serviceability-dev
mailing list