Draft JEP: JDK Core Libraries Test Stabilization
Hi all, Here's a draft JEP for stabilizing the core libraries regression test suite, that is, fixing up the spuriously failing tests. Please review and comment. Thanks! s'marks Title: JDK Core Libraries Test Stabilization Author: Stuart Marks Organization: Oracle Discussion: core-libs-dev@openjdk.java.net [...other metadata elided...] Summary ------- The JDK Regression Test Suite has several thousand fully automated tests. These tests are valuable and effective in that they serve to prevent bugs from entering the code base. However, they suffer from many intermittent failures. Many of these failures are "spurious" in that they are not caused by bugs in the product. Spurious failures add considerable noise to test reports; they make it impossible for developers to ascertain whether a particular change has introduced a bug; and they obscure actual failures. The reliability of the regression suite has improved considerably over the past few years. However, there are perhaps still 100-200 tests that fail intermittently, and most of these failures are spurious. This project aims to reduce the number and frequency of spuriously failing tests to a level where it no longer is an impediment to development. This project targets tests from the regression suite that cover the JDK Core Libraries, including base packages (java.lang, io, nio, util) I18N, Networking, RMI, Security, and Serviceability. JAXP and CORBA are also included, although they have relatively few regression tests at present. Non-Goals --------- Regression tests for other areas, including Hotspot, Langtools, and Client areas, are not included in this project. This project does not address operational issues that might cause builds or test runs to fail or for reports not to be delivered in a timely fashion. This project is not focused on product bugs that cause test failures. Such test failures are "good" in that the test suite is providing valid information about the product. Test runs on embedded platforms are not covered by this project. Success Metrics --------------- The reliability of a successful test run (100% pass) currently stands at approximately 0.5%. The goal is to improve this success rate to 98%, exclusive of true failures (i.e., those caused by bugs in the product). At a 98% success rate, a continuous build system that runs ten jobs per day, five days a week would have one or fewer spurious failures per week. Motivation ---------- Developers are continually hampered by the unreliability of the regression test suite. Intermittently failing tests add significant noise to the results of every test run. The consequence is that developers cannot tell whether test failures were caused by bugs introduced by a recent change or whether they are spurious failures. In addition, the intermittent failures mask actual failures in the product, slowing development and reducing quality. Developers should be able to rely on the test suite telling them accurate information: test failures should indicate the introduction of a bug into the system, and absence of test failures should be usable as evidence that changes are correct. Description ----------- Spurious test failures fall into two broad categories: - test bugs - environmental issues Our working assumption for most intermittent test failures is that they are spurious, and further, that they are caused by bugs in the test itself. While it is possible for a product bug to cause an intermittent failure, this is relatively rare. The majority of intermittent failures encountered so far have indeed proven to be test bugs. "Environmental" issues, such as misconfigured test machines, temporary dysfunction on the machine running the test job (e.g., filesystem full), or transient network failures, also contribute to spurious failures. Test should be made more robust, if possible. Environment issues should be fed back to the infrastructure team for resolution and future infrastructure improvements. A variety of techniques will be employed to diagnose, track, and help develop fixes for intermittently failing tests: - track all test failures in JBS - repeated test runs against the same build - gather statistics about failure rates, # tests with bugs, and track continuously - investigate pathologies for common test failure modes - develop techniques for fixing common test bugs - develop test library code to improve commonality across tests and to avoid typical failure modes - add instrumentation to tests (and to the test suite) to improve diagnosability - exclude tests judiciously, preferably only as a last resort - change reviews - code inspections Alternatives ------------ The most likely alternative to diagnosing and fixing intermittent failures is to aggressively exclude intermittently failing tests from the test suite. This trades off code coverage in favor of test reliability, adding risk of undetected bug introduction. Testing ------- The subject of the project is the test suite itself. The main "testing" of the test suite is running it repeatedly in a variety of environments, including continuous build-and-test systems, as well as recurring "same-binary" test runs on promoted builds. This will help flush out intermittent failures and detect newly introduced failures. Risks and Assumptions --------------------- We are working on a long tail of intermittent failures, which may become increasingly frustrating as time goes on, resulting in the project stalling out. New intermittent failures may be introduced or discovered more quickly than they can be resolved. The main work of fixing up the tests will be spread across several development groups. This requires good cross-group coordination and focus. The culture in the development group has (mostly) been to ignore test failures, or to find ways to cope with them. As intermittent failures are removed, we hope to decrease the group's tolerance of test failures. Dependences ----------- No dependences on other JEPs or components. Impact ------ No impact on specific parts of the platform or product, except for developer time and effort being spent on it, across various component teams. ==========
Hi Stuart, great proposal. You can count on me when it comes to testing on exotic platforms like for example AIX:) Regards, Volker On Thu, May 1, 2014 at 2:08 AM, Stuart Marks <stuart.marks@oracle.com> wrote:
Hi all,
Here's a draft JEP for stabilizing the core libraries regression test suite, that is, fixing up the spuriously failing tests. Please review and comment.
Thanks!
s'marks
Title: JDK Core Libraries Test Stabilization Author: Stuart Marks Organization: Oracle Discussion: core-libs-dev@openjdk.java.net [...other metadata elided...]
Summary -------
The JDK Regression Test Suite has several thousand fully automated tests. These tests are valuable and effective in that they serve to prevent bugs from entering the code base. However, they suffer from many intermittent failures. Many of these failures are "spurious" in that they are not caused by bugs in the product. Spurious failures add considerable noise to test reports; they make it impossible for developers to ascertain whether a particular change has introduced a bug; and they obscure actual failures.
The reliability of the regression suite has improved considerably over the past few years. However, there are perhaps still 100-200 tests that fail intermittently, and most of these failures are spurious. This project aims to reduce the number and frequency of spuriously failing tests to a level where it no longer is an impediment to development.
This project targets tests from the regression suite that cover the JDK Core Libraries, including base packages (java.lang, io, nio, util) I18N, Networking, RMI, Security, and Serviceability. JAXP and CORBA are also included, although they have relatively few regression tests at present.
Non-Goals ---------
Regression tests for other areas, including Hotspot, Langtools, and Client areas, are not included in this project.
This project does not address operational issues that might cause builds or test runs to fail or for reports not to be delivered in a timely fashion.
This project is not focused on product bugs that cause test failures. Such test failures are "good" in that the test suite is providing valid information about the product.
Test runs on embedded platforms are not covered by this project.
Success Metrics ---------------
The reliability of a successful test run (100% pass) currently stands at approximately 0.5%. The goal is to improve this success rate to 98%, exclusive of true failures (i.e., those caused by bugs in the product). At a 98% success rate, a continuous build system that runs ten jobs per day, five days a week would have one or fewer spurious failures per week.
Motivation ----------
Developers are continually hampered by the unreliability of the regression test suite. Intermittently failing tests add significant noise to the results of every test run. The consequence is that developers cannot tell whether test failures were caused by bugs introduced by a recent change or whether they are spurious failures. In addition, the intermittent failures mask actual failures in the product, slowing development and reducing quality. Developers should be able to rely on the test suite telling them accurate information: test failures should indicate the introduction of a bug into the system, and absence of test failures should be usable as evidence that changes are correct.
Description -----------
Spurious test failures fall into two broad categories:
- test bugs - environmental issues
Our working assumption for most intermittent test failures is that they are spurious, and further, that they are caused by bugs in the test itself. While it is possible for a product bug to cause an intermittent failure, this is relatively rare. The majority of intermittent failures encountered so far have indeed proven to be test bugs.
"Environmental" issues, such as misconfigured test machines, temporary dysfunction on the machine running the test job (e.g., filesystem full), or transient network failures, also contribute to spurious failures. Test should be made more robust, if possible. Environment issues should be fed back to the infrastructure team for resolution and future infrastructure improvements.
A variety of techniques will be employed to diagnose, track, and help develop fixes for intermittently failing tests:
- track all test failures in JBS - repeated test runs against the same build - gather statistics about failure rates, # tests with bugs, and track continuously - investigate pathologies for common test failure modes - develop techniques for fixing common test bugs - develop test library code to improve commonality across tests and to avoid typical failure modes - add instrumentation to tests (and to the test suite) to improve diagnosability - exclude tests judiciously, preferably only as a last resort - change reviews - code inspections
Alternatives ------------
The most likely alternative to diagnosing and fixing intermittent failures is to aggressively exclude intermittently failing tests from the test suite. This trades off code coverage in favor of test reliability, adding risk of undetected bug introduction.
Testing -------
The subject of the project is the test suite itself. The main "testing" of the test suite is running it repeatedly in a variety of environments, including continuous build-and-test systems, as well as recurring "same-binary" test runs on promoted builds. This will help flush out intermittent failures and detect newly introduced failures.
Risks and Assumptions ---------------------
We are working on a long tail of intermittent failures, which may become increasingly frustrating as time goes on, resulting in the project stalling out.
New intermittent failures may be introduced or discovered more quickly than they can be resolved.
The main work of fixing up the tests will be spread across several development groups. This requires good cross-group coordination and focus.
The culture in the development group has (mostly) been to ignore test failures, or to find ways to cope with them. As intermittent failures are removed, we hope to decrease the group's tolerance of test failures.
Dependences -----------
No dependences on other JEPs or components.
Impact ------
No impact on specific parts of the platform or product, except for developer time and effort being spent on it, across various component teams.
==========
participants (2)
-
Stuart Marks
-
Volker Simonis