RFR for bug JDK-8004807: java/util/Timer/Args.java failing intermittently in HS testing

Mon Jun 9 15:03:13 UTC 2014

Hi Eric, Martin,

I'm fine with the re-write.  I'm not sure why the re-ordering of y3 will 
change the
behavior of the test but it will provide more debugging info.

Roger

On 6/6/2014 9:32 PM, Martin Buchholz wrote:
> If you don't want to go with my rewrite, you can conservatively just 
> check in a 10x increase in all the constant durations and see whether 
> the flakiness goes away.
>
>
> On Thu, Jun 5, 2014 at 9:46 PM, Martin Buchholz <martinrb at google.com 
> <mailto:martinrb at google.com>> wrote:
>
>     As with David, the cause of the failure is mystifying.
>     How can things fail when we stay below the timeout value of 500ms?
>     There's a bug either in Timer or my own understanding of what
>     should be happening.
>
>     Anyways, raising the timeout value (as I have done in my minor
>     rewrite) seems prudent.  Fortunately, we can write this test in a
>     way that doesn't require actually waiting for the timeout to elapse.
>
>
>     On Wed, Jun 4, 2014 at 1:23 PM, roger riggs
>     <roger.riggs at oracle.com <mailto:roger.riggs at oracle.com>> wrote:
>
>         Hi Martin, Eric,
>
>         Of several hundred failures of this test, most were done in a
>         JRE run with
>         -Xcomp set.  A few failures occurred with -Xmixed, none with
>         -Xint.
>
>         The printed "elapsed" times (not normalized to hardware or OS)
>         range from
>         24 to 132 (ms) with most falling into several buckets in the
>         30's, 40's, 50's and 70's.
>
>         I don't spot anything in the Timer.mainLoop code that might
>         break when highly
>         optimized but that's one possibility.
>
>         Roger
>
>
>
>         On 6/4/2014 3:25 PM, Martin Buchholz wrote:
>
>             Tests for Timer are inherently timing (!) dependent.
>             It's reasonable for tests to assume that:
>             - reasonable events like creating a thread and executing a
>             simple task
>             should complete in less than, say 2500ms.
>             - system clock will not change by a significant amount (>
>             1 sec) during the
>             test.  Yes, that means Timer tests are likely to fail
>             during daylight
>             saving time switchover - we can live with that. (we could
>             even try to fix
>             that, by detecting deviations between clock time and
>             elapsed time, but
>             probably not worth it)
>
>             Can you detect any real-world unreliability in my latest
>             version of the
>             test, not counting daylight saving time switch?
>
>             I continue to resist your efforts to "fix" the test by
>             removing chances for
>             the SUT code to go wrong.
>
>
>             On Tue, Jun 3, 2014 at 11:28 PM, Eric Wang
>             <yiming.wang at oracle.com <mailto:yiming.wang at oracle.com>>
>             wrote:
>
>                   Hi Martin,
>
>                 Thanks for explanation, now I can understand why you
>                 set the DELAY_MS to
>                 100 seconds, it is true that it prevents failure on a
>                 slow host, however, i
>                 still have some concerns.
>                 Because the test tests to schedule tasks at the time
>                 in the past, so all
>                 13 tasks should be executed immediately and finished
>                 within a short time.
>                 If set the elapsed time limitation to 50s
>                 (DELAY_MS/2), it seems that the
>                 timer have plenty of time to finish tasks, so whether
>                 it causes above test
>                 point lost.
>
>                 Back to the original test, i think it should be a test
>                 stabilization
>                 issue, because the original test assumes that the
>                 timer should be cancelled
>                 within < 1 second before the 14th task is called. this
>                 assumption may not
>                 be guaranteed due to 2 reasons:
>                 1. if test is executed in jtreg concurrent mode on a
>                 slow host.
>                 2. the system clock of virtual machine may not be
>                 accurate (maybe faster
>                 than physical).
>
>                 To support the point, i changed the test as attached
>                 to print the
>                 execution time to see whether the timer behaves
>                 expected as the API
>                 document described. the result is as expected.
>
>                 The unrepeated task executed immediately: [1401855509336]
>                 The repeated task executed immediately and repeated
>                 per 1 second:
>                 [1401855509337, 1401855510337, 1401855511338]
>                 The fixed-rate task executed immediately and catch up
>                 the delay:
>                 [1401855509338, 1401855509338, 1401855509338,
>                 1401855509338, 1401855509338,
>                 1401855509338, 1401855509338, 1401855509338,
>                 1401855509338, 1401855509338,
>                 1401855509338, 1401855509836, 1401855510836]
>
>
>                 Thanks,
>                 Eric
>                 On 2014/6/4 9:16, Martin Buchholz wrote:
>
>
>
>
>                 On Tue, Jun 3, 2014 at 6:12 PM, Eric Wang
>                 <yiming.wang at oracle.com
>                 <mailto:yiming.wang at oracle.com>> wrote:
>
>                     Hi Martin,
>
>                     To sleep(1000) is not enough to reproduce the
>                     failure, because it is much
>                     lower than the period DELAY_MS (10*1000) of the
>                     repeated task created by
>                     "scheduleAtFixedRate(t, counter(y3), past, DELAY_MS)".
>
>                     Try sleep(DELAY_MS), the failure can be reproduced.
>
>                   Well sure, then the task is rescheduled, so I expect
>                 it to fail in this
>                 case.
>
>                   But in my version I had set DELAY_MS to 100 seconds.
>                  The point of
>                 extending the DELAY_MS is to prevent flaky failure on
>                 a slow machine.
>
>                   Again, how do we know that this test hasn't found a
>                 Timer bug?
>
>                   I still can't reproduce it.
>
>
>
>
>
>