RFR: 8346118: Improve whitespace normalization in preformatted text

Wed Mar 5 20:48:57 UTC 2025

On Mon, 3 Mar 2025 16:41:18 GMT, Hannes Wallnöfer <hannesw at openjdk.org> wrote:

> Please review an enhancement to make `DocCommentParser` normalize whitespace inside `<pre>` elements. The normalization is conceptually simple and and intended to be minimally invasive. Before parsing, `DocCommentParser` checks whether the text is a traditional doc comment and whether every line starts with a space character, which is commonly the case in traditional doc comments. If so, a single leading space is removed in block content (top level text and `{@code}`/`{@literal}` tags) when parsing within HTML `<pre>` tags.
> 
> This fixes the incidental one-space indentation in the vast majority of JDK code samples using `<pre>` alone or in combination with `<code>` or `{@code}`. In fact, I only found one code sample in JDK code that isn't solved by this change, for which I included a fix in this PR (it's in `String.startsWith(String, int)`, where I replaced the 10 char indentation and trailing line with a `<blockquote>`). 
> 
> The many added `boolean inBlockContent` arguments pased around in `DocCommentParser` are to make sure the removal is not applied to multiline inline content, which is maybe a bit fussy considering there is not a lot of multiline inline content in `<pre>` tags and it usually would not mind about removal of a non-essential space character, but I wanted to keep the change minimal. There are few javadoc tests that had to be adapted, most of the testing is done in `test/langtools/tools/javac/doctree`. 
> 
> If the exact number of leading whitespace in `<pre>` tags is important to any javadoc user the old output can be restored by increasing the indentation by 1. There will be a release note for this of course. 
> 
> Unfortunately, there is another whitespace problem that can't be solved as easily, and that is a leading blank line caused by `<pre><code>\n` open tags. Browsers will [ignore a newline immediately following a `<pre>` tag][1], but not if there is a `<code>` tag in between. There are hundreds of occurrences of this in JDK code, including variants with space characters mixed in. The fix in javadoc proper would be too complex, so I decided to solve it with 3 lines of JavaScript and a regex to reverse the order of `<code>\n` at the beginning of `<pre>` tags while removing any intermediary space. Script operation is indiscernible and it solves the problem.
> 
> [1]: https://html.spec.whatwg.org/#the-pre-element:the-pre-element

Thanks for reviewing this, Jon. I agree there are two issues that maybe shold be handled separately. 

Regarding the first issue, stripping leading space in `DocCommentParser`, I did in fact start out processing the whole doc comment. I was pretty far into fixing tests when I realized that this approach breaks our ability to get a diagnostic position from a string extracted from the comment. The source postion mapping is generated in `JavadocScanner`, and it would be quite complex to keep track of whitespace removed after the fact and correctly translate to a source position. 

The test that made me realize this problem was [TestDocTreeDiags.java], which tests reporting positions inside strings extracted from the comment (added in [JDK-8268420]) , but with wholesale whitespace removal also start positions are affected. I should have mentioned this in the PR description, but it was already long enough as it was.

[TestDocTreeDiags.java]: https://github.com/openjdk/jdk/blob/master/test/langtools/jdk/javadoc/doclet/testDocTreeDiags/TestDocTreeDiags.java
[JDK-8268420]: https://bugs.openjdk.org/browse/JDK-8268420

This is why I reverted to the proposed minimalistic solution, limiting space removal to text and `{@code}`/`{@literal}` tags within `<pre>` elements. These are all doctrees that are passed though without further processing, so it is highly unlikely that anybody would ever want to report a source position inside of them. The begin postions will be correct since we're parsing the unchanged doc comment, but positions inside the string after the first line break would report a wrong position. This is why I filed [a CSR to amend the spec for above mentioned method][CSR] with a note about this. 

[CSR]: https://bugs.openjdk.org/browse/JDK-8350428

Regarding the second issue, I agree the JavaScript solution is not ideal. But it's a small price to pay in order to get readable and good looking documentation. We have close to 1000 instances of `<pre>{@code}` in JDK source code alone, and they will never get fixed and continue to haunt our otherwise great documentation. The same problem affects most Java libraries out there. You see I'm quite passionate about this, but I agree to move it to a separate issue if you think that's the best way to handle this.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23868#issuecomment-2702033467