Proposed new policies for JDK 9 regression tests: tiered testing, intermittent failures, and randomness

Thu Mar 19 00:43:43 UTC 2015

Hello,

Over the last several years, there has been a background thread of 
activity to improve the robustness and reliability of JDK regression 
testing. In a effort to further improve testing of JDK 9, I have a few 
proposals to put forward for discussion:

* Tiered testing
* Marking tests which fail intermittently
* Marking tests which use randomness

Some portions of the JDK regression tests are more stable that others; 
that is, some portions are less prone to intermittent test failures. 
Some aspects of the JDK are also more fundamental than others. For 
example, javac and the core libraries are necessary for any kind of Java 
development work while jaxp is not.

With those considerations in mind and taking into account the graph of 
integration forests for JDK 9 [1], I propose an initial definition of 
two tiers of tests:

* Tier 1: stable tests for fundamental parts of the platform. Failures 
in tier 1 tests are treated as urgent issues to resolve, on par with a 
build failure.

* Tier 2: tests which may be less stable or less fundamental parts of 
the platform. Resolving failures in tier 2 tests is important, but not 
as urgent as resolving a tier 1 failure.

The initial proposed population of tier 1 and tier 2 regression tests is:

Tier 1 tests:
     jdk/test: jdk_lang, jdk_util, jdk_math
     langtools/test

Tier 2 tests:
     jdk/test: jdk_io, jdk_nio, jdk_net, jdk_rmi, jdk_time, 
jdk_security, jdk_text, core_tools, jdk_other, jdk_svc
     nashorn/test
     jaxp/test:jaxp_all

The regression tests for client areas are not run as commonly as other 
regression tests; client tests could be added as a third tier or 
incorporated into tier 2 over time. Given how HotSpot integrates in 
jdk9/dev after going through its own set of integration forests, the 
current definitions of tiered testing is aimed at langtools and 
libraries work.

Some of the areas included in tier 2 above are very fundamental, such as 
jdk_io, but still have some testing issues. Once those issues are 
resolved, a test set like jdk_io could be promoted from tier 2 to tier 1.

These definitions of tiered tests can be implemented as entries in the 
TEST.group files used by jtreg in the various Hg component repositories, 
jdk, langtools, jaxp, and nashorn.

One goal of this explicit tiered testing policy is that all the tier 1 
tests would always pass on the master. In other words, in the 
steady-state situation, integrations from dev into master should not 
introduce tier 1 test failures on mainline platforms.

Resolving a new persistent test failure could be accomplished in 
multiple ways. If there there a flaw in new code or the new test, the 
new code or test could be fixed. If developing a full fix would take a 
while, the test could be @exclude-d or put on the problem list while the 
full fix is being tracked in another bug. Finally, if the testing 
situation is sufficient bad, the changeset which introduced the problem 
can be anti-delta-ed out.

Currently it is difficult to know what set of JDK regression tests 
intermittently fail. To make this determination easier, I propose 
defining for use in the JDK repositories a new jtreg keyword , say 
"intermittent-failure", that would be added to tests known or suspected 
to fail intermittently. The jtreg harness supports defining a set a 
keywords for a set of tests in the TEST.ROOT file. The affected (or 
afflicted) tests would get a

     @key intermittent-failure

line as one of their jtreg tags. Besides documenting the problems of the 
test in the test itself, a command like

     jtreg -keywords:intermittent-failure ...

could be used to run the intermittently tailing tests as a group, such 
as in a dedicated attempt to gather more failure information about the 
tests.

Some tests want to explore a large space a inputs, often a space of 
inputs so large is is impractical or undesirable to exhaustively explore 
the space in routine testing. One way to get better test coverage in 
this kind of situation over time is for a test of the area to use 
randomness to explore a different subset of the input space of different 
runs. This use of randomness can be a valid testing technique, but it 
must be used with care.

If such a random-using test fails intermittently, it may be because it 
is encountering a real flaw in the product, but a flaw that is only 
visited rarely. If such a failure is observed, it should be added to the 
bug database along with the seed value that can be used to reproduce the 
situation. As a corollary, all such random-using tests should print out 
the seed value they are using a provide a way to set the seed value on a 
given run.

To aid those analyzing the failure of a random-using test, a new jtreg 
keyword like "uses-randomness" should be added.

Thanks to Alan Bateman, Stuart Marks, and Stefan Särne for many 
conversations leading up to these proposals.

Comments?

-Joe

[1] "Proposal to revise forest graph and integration practices for JDK 9,"
http://mail.openjdk.java.net/pipermail/jdk9-dev/2013-November/000000.html