RFR for bug JDK-8004807: java/util/Timer/Args.java failing intermittently in HS testing

Martin Buchholz martinrb at google.com
Sat Jun 7 01:32:41 UTC 2014


If you don't want to go with my rewrite, you can conservatively just check
in a 10x increase in all the constant durations and see whether the
flakiness goes away.


On Thu, Jun 5, 2014 at 9:46 PM, Martin Buchholz <martinrb at google.com> wrote:

> As with David, the cause of the failure is mystifying.
> How can things fail when we stay below the timeout value of 500ms?
> There's a bug either in Timer or my own understanding of what should be
> happening.
>
> Anyways, raising the timeout value (as I have done in my minor rewrite)
> seems prudent.  Fortunately, we can write this test in a way that doesn't
> require actually waiting for the timeout to elapse.
>
>
> On Wed, Jun 4, 2014 at 1:23 PM, roger riggs <roger.riggs at oracle.com>
> wrote:
>
>> Hi Martin, Eric,
>>
>> Of several hundred failures of this test, most were done in a JRE run with
>> -Xcomp set.  A few failures occurred with -Xmixed, none with -Xint.
>>
>> The printed "elapsed" times (not normalized to hardware or OS) range from
>> 24 to 132 (ms) with most falling into several buckets in the 30's, 40's,
>> 50's and 70's.
>>
>> I don't spot anything in the Timer.mainLoop code that might break when
>> highly
>> optimized but that's one possibility.
>>
>> Roger
>>
>>
>>
>> On 6/4/2014 3:25 PM, Martin Buchholz wrote:
>>
>>> Tests for Timer are inherently timing (!) dependent.
>>> It's reasonable for tests to assume that:
>>> - reasonable events like creating a thread and executing a simple task
>>> should complete in less than, say 2500ms.
>>> - system clock will not change by a significant amount (> 1 sec) during
>>> the
>>> test.  Yes, that means Timer tests are likely to fail during daylight
>>> saving time switchover - we can live with that. (we could even try to fix
>>> that, by detecting deviations between clock time and elapsed time, but
>>> probably not worth it)
>>>
>>> Can you detect any real-world unreliability in my latest version of the
>>> test, not counting daylight saving time switch?
>>>
>>> I continue to resist your efforts to "fix" the test by removing chances
>>> for
>>> the SUT code to go wrong.
>>>
>>>
>>> On Tue, Jun 3, 2014 at 11:28 PM, Eric Wang <yiming.wang at oracle.com>
>>> wrote:
>>>
>>>    Hi Martin,
>>>>
>>>> Thanks for explanation, now I can understand why you set the DELAY_MS to
>>>> 100 seconds, it is true that it prevents failure on a slow host,
>>>> however, i
>>>> still have some concerns.
>>>> Because the test tests to schedule tasks at the time in the past, so all
>>>> 13 tasks should be executed immediately and finished within a short
>>>> time.
>>>> If set the elapsed time limitation to 50s (DELAY_MS/2), it seems that
>>>> the
>>>> timer have plenty of time to finish tasks, so whether it causes above
>>>> test
>>>> point lost.
>>>>
>>>> Back to the original test, i think it should be a test stabilization
>>>> issue, because the original test assumes that the timer should be
>>>> cancelled
>>>> within < 1 second before the 14th task is called. this assumption may
>>>> not
>>>> be guaranteed due to 2 reasons:
>>>> 1. if test is executed in jtreg concurrent mode on a slow host.
>>>> 2. the system clock of virtual machine may not be accurate (maybe faster
>>>> than physical).
>>>>
>>>> To support the point, i changed the test as attached to print the
>>>> execution time to see whether the timer behaves expected as the API
>>>> document described. the result is as expected.
>>>>
>>>> The unrepeated task executed immediately: [1401855509336]
>>>> The repeated task executed immediately and repeated per 1 second:
>>>> [1401855509337, 1401855510337, 1401855511338]
>>>> The fixed-rate task executed immediately and catch up the delay:
>>>> [1401855509338, 1401855509338, 1401855509338, 1401855509338,
>>>> 1401855509338,
>>>> 1401855509338, 1401855509338, 1401855509338, 1401855509338,
>>>> 1401855509338,
>>>> 1401855509338, 1401855509836, 1401855510836]
>>>>
>>>>
>>>> Thanks,
>>>> Eric
>>>> On 2014/6/4 9:16, Martin Buchholz wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jun 3, 2014 at 6:12 PM, Eric Wang <yiming.wang at oracle.com>
>>>> wrote:
>>>>
>>>>  Hi Martin,
>>>>>
>>>>> To sleep(1000) is not enough to reproduce the failure, because it is
>>>>> much
>>>>> lower than the period DELAY_MS (10*1000) of the repeated task created
>>>>> by
>>>>> "scheduleAtFixedRate(t, counter(y3), past, DELAY_MS)".
>>>>>
>>>>> Try sleep(DELAY_MS), the failure can be reproduced.
>>>>>
>>>>>    Well sure, then the task is rescheduled, so I expect it to fail in
>>>> this
>>>> case.
>>>>
>>>>   But in my version I had set DELAY_MS to 100 seconds.  The point of
>>>> extending the DELAY_MS is to prevent flaky failure on a slow machine.
>>>>
>>>>   Again, how do we know that this test hasn't found a Timer bug?
>>>>
>>>>   I still can't reproduce it.
>>>>
>>>>
>>>>
>>>>
>>
>



More information about the core-libs-dev mailing list