Draft JEP: JDK Core Libraries Test Stabilization

Thu May 1 00:08:37 UTC 2014

Hi all,

Here's a draft JEP for stabilizing the core libraries regression test suite, 
that is, fixing up the spuriously failing tests. Please review and comment.

Thanks!

s'marks

Title: JDK Core Libraries Test Stabilization
Author: Stuart Marks
Organization: Oracle
Discussion: core-libs-dev at openjdk.java.net
[...other metadata elided...]

Summary
-------

The JDK Regression Test Suite has several thousand fully automated
tests. These tests are valuable and effective in that they serve to
prevent bugs from entering the code base. However, they suffer from
many intermittent failures. Many of these failures are "spurious" in
that they are not caused by bugs in the product. Spurious failures add
considerable noise to test reports; they make it impossible for
developers to ascertain whether a particular change has introduced a
bug; and they obscure actual failures.

The reliability of the regression suite has improved considerably over
the past few years. However, there are perhaps still 100-200 tests
that fail intermittently, and most of these failures are spurious.
This project aims to reduce the number and frequency of spuriously
failing tests to a level where it no longer is an impediment to
development.

This project targets tests from the regression suite that cover the
JDK Core Libraries, including base packages (java.lang, io, nio, util)
I18N, Networking, RMI, Security, and Serviceability. JAXP and CORBA
are also included, although they have relatively few regression tests
at present.

Non-Goals
---------

Regression tests for other areas, including Hotspot, Langtools, and
Client areas, are not included in this project.

This project does not address operational issues that might cause
builds or test runs to fail or for reports not to be delivered in a
timely fashion.

This project is not focused on product bugs that cause test
failures. Such test failures are "good" in that the test suite is
providing valid information about the product.

Test runs on embedded platforms are not covered by this project.

Success Metrics
---------------

The reliability of a successful test run (100% pass) currently stands
at approximately 0.5%. The goal is to improve this success rate to
98%, exclusive of true failures (i.e., those caused by bugs in the
product). At a 98% success rate, a continuous build system that runs ten
jobs per day, five days a week would have one or fewer spurious
failures per week.

Motivation
----------

Developers are continually hampered by the unreliability of the
regression test suite. Intermittently failing tests add significant
noise to the results of every test run. The consequence is that
developers cannot tell whether test failures were caused by bugs
introduced by a recent change or whether they are spurious
failures. In addition, the intermittent failures mask actual failures
in the product, slowing development and reducing quality. Developers
should be able to rely on the test suite telling them accurate
information: test failures should indicate the introduction of a bug
into the system, and absence of test failures should be usable as
evidence that changes are correct.

Description
-----------

Spurious test failures fall into two broad categories:

  - test bugs
  - environmental issues

Our working assumption for most intermittent test failures is that
they are spurious, and further, that they are caused by bugs in the
test itself. While it is possible for a product bug to cause an
intermittent failure, this is relatively rare. The majority of
intermittent failures encountered so far have indeed proven to be test
bugs.

"Environmental" issues, such as misconfigured test machines, temporary
dysfunction on the machine running the test job (e.g., filesystem
full), or transient network failures, also contribute to spurious
failures. Test should be made more robust, if possible. Environment
issues should be fed back to the infrastructure team for resolution
and future infrastructure improvements.

A variety of techniques will be employed to diagnose, track, and help
develop fixes for intermittently failing tests:

  - track all test failures in JBS
  - repeated test runs against the same build
  - gather statistics about failure rates, # tests with bugs, and track continuously
  - investigate pathologies for common test failure modes
  - develop techniques for fixing common test bugs
  - develop test library code to improve commonality across tests and to
    avoid typical failure modes
  - add instrumentation to tests (and to the test suite) to improve diagnosability
  - exclude tests judiciously, preferably only as a last resort
  - change reviews
  - code inspections

Alternatives
------------

The most likely alternative to diagnosing and fixing intermittent
failures is to aggressively exclude intermittently failing tests from
the test suite. This trades off code coverage in favor of test
reliability, adding risk of undetected bug introduction.

Testing
-------

The subject of the project is the test suite itself. The main
"testing" of the test suite is running it repeatedly in a variety of
environments, including continuous build-and-test systems, as well as
recurring "same-binary" test runs on promoted builds. This will help
flush out intermittent failures and detect newly introduced failures.

Risks and Assumptions
---------------------

We are working on a long tail of intermittent failures, which may
become increasingly frustrating as time goes on, resulting in the
project stalling out.

New intermittent failures may be introduced or discovered more quickly
than they can be resolved.

The main work of fixing up the tests will be spread across several
development groups. This requires good cross-group coordination and
focus.

The culture in the development group has (mostly) been to ignore test
failures, or to find ways to cope with them. As intermittent failures
are removed, we hope to decrease the group's tolerance of test failures.

Dependences
-----------

No dependences on other JEPs or components.

Impact
------

No impact on specific parts of the platform or product, except for
developer time and effort being spent on it, across various component
teams.

==========