Proposed new policies for JDK 9 regression tests: tiered testing, intermittent failures, and randomness

Thu Apr 30 00:12:31 UTC 2015

A follow-up,

The tiered testing definitions have been added to the jdk repository 
(JDK-8075544). Two new jtreg keywords were defined and added to 
appropriate jdk regression tests, "intermittent" (JDK-8075565) and 
"randomness" (JDK-8078334).

Tests that are observed to fail intermittently should be tagged with the 
"intermittent" keyword until such time as the flakiness of the test's 
behavior is resolved.

Going forward, when a new regression test is written or an existing test 
updated, the presence or absence of the "randomness" keyword should be 
kept up-to-date with the behavior of the test. As explained in the 
TEST.ROOT file,

> # The "randomness" keyword marks tests using randomness with test
> # cases differing from run to run. (A test using a fixed random seed
> # would not count as "randomness" by this definition.) Extra care
> # should be taken to handle test failures of intermittent or
> # randomness tests.

If a "randomness" test fails, and especially if it fails intermittently, 
the seed value in force during the failing run should be included in a 
bug report about the failure. One part of investigating the failure is 
to see if the intermittent failure becomes reproducible if the seed is 
set to the value observed during a failing run.

There is now a random number utility library for regression testing in 
the jdk repository (JDK-8078672); the facility is located at

     test/lib/testlibrary/jdk/testlibrary/RandomFactory.java

and can be accessed via jtreg @library facility with jtreg tags like

     * ...
     * @library /lib/testlibrary
     * @build jdk.testlibrary.*
     * ...

Calls to

     new Random()

in regression tests can be replaced with

     RandomFactory.getRandom()

In brief, requesting a random number generator from the factory outputs 
the seed it is using and a test run can be made to use a particular seed 
by passing in a -Dseed=X option to the jtreg test run.

Cheers,

-Joe

On 3/18/2015 5:43 PM, joe darcy wrote:
> Hello,
>
> Over the last several years, there has been a background thread of 
> activity to improve the robustness and reliability of JDK regression 
> testing. In a effort to further improve testing of JDK 9, I have a few 
> proposals to put forward for discussion:
>
> * Tiered testing
> * Marking tests which fail intermittently
> * Marking tests which use randomness
>
> Some portions of the JDK regression tests are more stable that others; 
> that is, some portions are less prone to intermittent test failures. 
> Some aspects of the JDK are also more fundamental than others. For 
> example, javac and the core libraries are necessary for any kind of 
> Java development work while jaxp is not.
>
> With those considerations in mind and taking into account the graph of 
> integration forests for JDK 9 [1], I propose an initial definition of 
> two tiers of tests:
>
> * Tier 1: stable tests for fundamental parts of the platform. Failures 
> in tier 1 tests are treated as urgent issues to resolve, on par with a 
> build failure.
>
> * Tier 2: tests which may be less stable or less fundamental parts of 
> the platform. Resolving failures in tier 2 tests is important, but not 
> as urgent as resolving a tier 1 failure.
>
> The initial proposed population of tier 1 and tier 2 regression tests is:
>
> Tier 1 tests:
>     jdk/test: jdk_lang, jdk_util, jdk_math
>     langtools/test
>
> Tier 2 tests:
>     jdk/test: jdk_io, jdk_nio, jdk_net, jdk_rmi, jdk_time, 
> jdk_security, jdk_text, core_tools, jdk_other, jdk_svc
>     nashorn/test
>     jaxp/test:jaxp_all
>
> The regression tests for client areas are not run as commonly as other 
> regression tests; client tests could be added as a third tier or 
> incorporated into tier 2 over time. Given how HotSpot integrates in 
> jdk9/dev after going through its own set of integration forests, the 
> current definitions of tiered testing is aimed at langtools and 
> libraries work.
>
> Some of the areas included in tier 2 above are very fundamental, such 
> as jdk_io, but still have some testing issues. Once those issues are 
> resolved, a test set like jdk_io could be promoted from tier 2 to tier 1.
>
> These definitions of tiered tests can be implemented as entries in the 
> TEST.group files used by jtreg in the various Hg component 
> repositories, jdk, langtools, jaxp, and nashorn.
>
> One goal of this explicit tiered testing policy is that all the tier 1 
> tests would always pass on the master. In other words, in the 
> steady-state situation, integrations from dev into master should not 
> introduce tier 1 test failures on mainline platforms.
>
> Resolving a new persistent test failure could be accomplished in 
> multiple ways. If there there a flaw in new code or the new test, the 
> new code or test could be fixed. If developing a full fix would take a 
> while, the test could be @exclude-d or put on the problem list while 
> the full fix is being tracked in another bug. Finally, if the testing 
> situation is sufficient bad, the changeset which introduced the 
> problem can be anti-delta-ed out.
>
> Currently it is difficult to know what set of JDK regression tests 
> intermittently fail. To make this determination easier, I propose 
> defining for use in the JDK repositories a new jtreg keyword , say 
> "intermittent-failure", that would be added to tests known or 
> suspected to fail intermittently. The jtreg harness supports defining 
> a set a keywords for a set of tests in the TEST.ROOT file. The 
> affected (or afflicted) tests would get a
>
>     @key intermittent-failure
>
> line as one of their jtreg tags. Besides documenting the problems of 
> the test in the test itself, a command like
>
>     jtreg -keywords:intermittent-failure ...
>
> could be used to run the intermittently tailing tests as a group, such 
> as in a dedicated attempt to gather more failure information about the 
> tests.
>
> Some tests want to explore a large space a inputs, often a space of 
> inputs so large is is impractical or undesirable to exhaustively 
> explore the space in routine testing. One way to get better test 
> coverage in this kind of situation over time is for a test of the area 
> to use randomness to explore a different subset of the input space of 
> different runs. This use of randomness can be a valid testing 
> technique, but it must be used with care.
>
> If such a random-using test fails intermittently, it may be because it 
> is encountering a real flaw in the product, but a flaw that is only 
> visited rarely. If such a failure is observed, it should be added to 
> the bug database along with the seed value that can be used to 
> reproduce the situation. As a corollary, all such random-using tests 
> should print out the seed value they are using a provide a way to set 
> the seed value on a given run.
>
> To aid those analyzing the failure of a random-using test, a new jtreg 
> keyword like "uses-randomness" should be added.
>
> Thanks to Alan Bateman, Stuart Marks, and Stefan Särne for many 
> conversations leading up to these proposals.
>
> Comments?
>
> -Joe
>
> [1] "Proposal to revise forest graph and integration practices for JDK 
> 9,"
> http://mail.openjdk.java.net/pipermail/jdk9-dev/2013-November/000000.html