RFR: 8337111: Bad HTML checker for generated documentation

Hannes Wallnöfer hannesw at openjdk.org
Thu Nov 14 12:10:22 UTC 2024


On Mon, 4 Nov 2024 15:43:40 GMT, Nizar Benalla <nbenalla at openjdk.org> wrote:

> Doccheck's human-generated reports are great at previewing a "chessboard" of results. Giving reader a quick glimpse at the quality/health of the documentation. But these tests needed to be automated and they didn't easily translate to something that can be integrated into a CI.
> 
> This PR includes an HTML and internal link test on `api/java.base` and a BadChars and Doctype test on the entire generated documentation bundle.
> 
> Here is an example of the output after running all tests on `api/java.base`
> 
> Note: There is an active PR to fix the broken anchors left in `java.base` so this is not a blocker.
> 
> 
> 
> STDOUT:
> STDERR:
> test: test
> Tidy found errors in the generated HTML
> /Users/nizarbenalla/Work/jdk-repos/jdk1/build/macosx-aarch64/images/docs/api/java.base/java/lang/Class.html:323:87: Warning: <a> anchor "nest" already defined
> Tidy output end.
> 
> 
> api/java.base/java/util/concurrent/StructuredTaskScope.ShutdownOnFailure.html:245: id not found: api/java.base/java/util/concurrent/StructuredTaskScope.ShutdownOnFailure.html#TreeStructure
> api/java.base/java/util/concurrent/StructuredTaskScope.ShutdownOnSuccess.html:242: id not found: api/java.base/java/util/concurrent/StructuredTaskScope.ShutdownOnSuccess.html#TreeStructure
> api/java.base/java/lang/Class.html:323: name already declared: nest
> api/java.base/java/lang/Module.html:291: id not found: api/java.base/java/lang/foreign/package-summary.html#restricted
> api/java.base/java/lang/Module.html:434: id not found: api/java.base/java/lang/foreign/package-summary.html#restricted
> api/java.base/java/lang/foreign/MemorySegment.html:725: id not found: api/java.base/java/lang/foreign/package-summary.html#restricted
> 
> Link Checker Report
> Checked 3446 files.
> Found 445059 references to 48205 anchors in 5770 files and 64 other URIs.
>      1 duplicate ids
>      3 missing ids
> 
> Hosts
>     20 docs.oracle.com
>      1 tools.ietf.org
>      1 www.ietf.org
>      1 jcp.org
>      4 www.rfc-editor.org
>      7 unicode.org
>     10 www.unicode.org
>     20 www.w3.org
> Exception running test test: java.lang.Exception: One or more HTML checkers failed: [java.lang.RuntimeException: Tidy found errors in the generated HTML, java.lang.RuntimeException: LinkChecker encountered errors. Duplicate IDs: 1, Missing IDs: 3, Missing Files: 0, Bad Schemes: 0]
> java.lang.Exception: One or more HTML checkers failed: [java.lang.RuntimeException: Tidy found errors in the generated HTML, java.lang.RuntimeException: LinkChecker encountered errors. Duplicate IDs: 1, Missing IDs: 3, Mi...

I'm not quite done reviewing this, but here's a first batch of comments. By and large I like what I see, the comments are mostly about details.

test/docs/jdk/javadoc/doccheck/DocCheck.java line 104:

> 102:         if (!checks.isEmpty()) {
> 103:             if (checks.contains(",")) {
> 104:                 EXCLUDE_LIST.addAll(Arrays.asList(checks.split(",")));

It looks like the content of `EXCLUDE_LIST` is never used.

test/docs/jdk/javadoc/doccheck/DocCheck.java line 115:

> 113:             badchars = true;
> 114:             doctype = true;
> 115:         }else {

Missing space before `else` above, and in several places after `if` in the lines below.

test/docs/jdk/javadoc/doccheck/doccheckutils/HtmlChecker.java line 32:

> 30:  * <p>
> 31:  * For details on HTML syntax and the terms used in this API, see
> 32:  * W3C <a href="https://www.w3.org/TR/html5/syntax.html#syntax">The HTML syntax</a>.

This URL redirects to https://html.spec.whatwg.org/multipage/syntax.html#syntax. Should we use that URL instead?

test/docs/jdk/javadoc/doccheck/doccheckutils/Log.java line 43:

> 41: 
> 42:     public void log(Path path, int line, String message, Object... args) {
> 43:         errors.add(formatErrorMessage(path, line, message, args));

It's a strange that this class is called `Log` and it has several `log` methods, but they are also used to report and track errors. It seems like some checkers use this class to track errors, while others use it purely for logging. Maybe the two features should be separated, for example by adding a dedicated `logError` method?

test/docs/jdk/javadoc/doccheck/doccheckutils/Log.java line 88:

> 86:     }
> 87: 
> 88:     public Path againstBaseDir(Path path) {

Maybe call this `getRelativePath`, or `relativize` to be even shorter? 

I notice that this method is only used by `LinkChecker`, mostly before passing a `Path` argument to `log(...)`.  Should it be moved to `LinkChecker`, or maybe kept here and but invoked locally by `log` methods on `Path` arguments?

test/docs/jdk/javadoc/doccheck/doccheckutils/checkers/BadCharacterChecker.java line 122:

> 120:                 return Charset.forName(m2.group(1));
> 121:             }
> 122:             return html5 ? StandardCharsets.UTF_8 : StandardCharsets.ISO_8859_1;

What is the basis for assuming ISO-8859-1 for non-HTML5 files?

test/docs/jdk/javadoc/doccheck/doccheckutils/checkers/DocTypeChecker.java line 76:

> 74:                     + "\\s+html"
> 75:                     + "\\s+([a-z]+)"
> 76:                     + "\\s+\"([^\"]+)\""

Doctype legacy string can also use single quotes. Accepting that would make the test more robust, although making sure start and end quotes match could be difficult using a regex.

test/docs/jdk/javadoc/doccheck/doccheckutils/checkers/DocTypeChecker.java line 77:

> 75:                     + "\\s+([a-z]+)"
> 76:                     + "\\s+\"([^\"]+)\""
> 77:                     + "(?:\\s+\"([^\"]+)\")?"

Where is this part of the doctype specified?

test/docs/jdk/javadoc/doccheck/doccheckutils/checkers/DocTypeChecker.java line 114:

> 112:         log.log("DocType Report");
> 113:         if (xml > 0) {
> 114:             log.log("%6d: XHTML%n", xml);

All format strings in this method have a trailing newline `%n` which is not necessary, as `log` prints a line break.

What would be nice to have is an empty line before and after the report output.

test/docs/jdk/javadoc/doccheck/doccheckutils/checkers/DocTypeChecker.java line 155:

> 153:     public void close() throws IOException {
> 154:         if (!isOK()) {
> 155:             report();

I think `report()` should be always called here like in the other checkers, even if nothing fails.

test/docs/jdk/javadoc/doccheck/doccheckutils/checkers/LinkChecker.java line 45:

> 43:     private final Map<Path, IDTable> allFiles;
> 44:     private final Map<URI, IDTable> allURIs;
> 45:     private boolean checkInwardReferencesOnly;

This field is always `false`.

test/docs/jdk/javadoc/doccheck/doccheckutils/checkers/LinkChecker.java line 62:

> 60:     }
> 61: 
> 62:     public void setBaseDirWereChecking(Path dir) {

Is this supposed to be "Where"? I think just `setBaseDirectory` or `setBaseDir` would be fine.

test/docs/jdk/javadoc/doccheck/doccheckutils/checkers/LinkChecker.java line 101:

> 99:     @Override
> 100:     public void xml(int line, Map<String, String> attrs) {
> 101:         xml = true;

The `xml` field is never used.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21879#pullrequestreview-2433154779
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1840325296
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1840293113
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1840463728
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1842101129
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1840735356
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1841728127
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1841753792
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1841754158
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1840975250
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1840938158
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1842081816
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1840727993
PR Review Comment: https://git.openjdk.org/jdk/pull/21879#discussion_r1840716194


More information about the javadoc-dev mailing list