RFR: JDK-8288699: cleanup HTML tree in HtmlDocletWriter.commentTagsToContent [v2]

Fri Jul 8 09:56:49 UTC 2022

On Wed, 6 Jul 2022 15:24:53 GMT, Jonathan Gibbons <jjg at openjdk.org> wrote:

>> Please review a particularly satisfying review of a cleanup for
>> `HtmlDocletWriter.commentTagsToContent`, and the use of `RawHtml`.
>> 
>> The background for this is a separate cleanup for `CommentHelper.getText`,
>> during which it was observed that the code could be eliminated, with some
>> uses being replaced by `HtmlDocletWriter.commentTagsToContent`. That led
>> to more tags being processed, which in turn led to the observation that
>> `Content.charCount` was often giving incorrect results.
>> 
>> The primary root cause of the problem with `Content.charCount` was the
>> handling of `StartElementTree` in the visitor in `commentTagsToContent`.
>> The problem is that code there creates a sequence of text nodes for the
>> attributes, preceded by a `RawHtml` object containing `<TAGNAME` and followed
>> by another separate `RawHtml` node containing just `>`.  The net result
>> is that `charCount` is mistaken into including the size of the intervening
>> text nodes in the total, because it does not recognize the "unbalanced" `<`
>> and `>` in the surrounding nodes.
>> 
>> The general system fix for `commentTagsToContent` is to change the _visit..._
>> methods to always append to the `Content` object supplied as a parameter, instead
>> of to a fixed local variable (`result`) in the enclosing scope. This allows
>> `visitStartElement` to accumulate the attributes in a temporary buffer,
>> and then create a single `RawHtml` node as the translation of the
>> `StartElementTree`.  Changing the methodology of the visitors also
>> required disentangling the processing of `{@docRoot}`, which generally
>> always appears inside the `href` attribute, typically followed by
>> the rest of the URL. The complication there is that since forever there
>> has been special handling of `{@docRoot}/..` which can be replaced by
>> the value of the `--docrootparent` option. In reviewing that part of the code,
>> it may help to just ignore the old code and just review the new replacement
>> code.
>> 
>> Starting with the code in `commentTagsToContent`, it became worthwhile to
>> use factory methods on `RawHtml` for stylized uses of the class, for _start
>> element_, _end element_, and _comment_. These all have a `charCount` of zero,
>> so it only becomes necessary to compute the character count when given
>> arbitrary text.
>> 
>> Related, the `Text` and `TextBuilder` classes are changed so that they 
>> contain unescaped content, and only escape the HTML characters when they
>> are written out. This allows their `charCount` to simply return the `length()`
>> of the string content, instead of having to parse the content to account
>> of any entities that may be present.  This also means that newline characters
>> will be counted; previously, they were not.
>> 
>> Another noteworthy change. Somewhat usually, some resource strings for the
>> description of mandated methods, such as for enum and record classes, contain
>> simple HTML (`<i>...</i>`), which is modelled in a single `TextTree` and
>> subsequently converted to a non-standard use of `RawHtml`. To avoid
>> significantly changing the resource strings, the solution is to parse the
>> resources into a suitable `List<DocTree>`, using a relatively simple regular
>> expression. The goal is to _not_ extend the support any further than that
>> provided.
>> 
>> Finally, one more change/bug fix. We do not claim to support HTML CDATA
>> sections, but there is one in the JDK API docs (in `coll-index.html`). 
>> It "mostly worked" by accident, but dropped the leading `<`, which "broke"
>> the section in the generated output. I've put code into `DocCommentParser`
>> to handle CDATA sections, representing them (for now) as a `Text` node
>> containing just the section itself. This is then handled specially up in
>> `commentTagsToContent`, where it is translated to a new `RawHtml` node.
>> 
>> For reviewing, I suggest starting with `RawHtml` and then `HtmlDocletWriter`.
>> Many of the other changes are derivative changes, such as changing
>> `new RawHtml(...)` to `RawHtml.of(...)` and similar other changes.
>> 
>> All tests continue to pass without change.
>> 
>> There are minor changes to the generated docs, in two flavors.
>> 
>> 1. There is one instance, in `Collectors.html`, where the fixes to `charCount`
>>    are enough to change the separator after the type parameters in a 
>>    method summary table from ` ` to `<br>`. 
>>    See `AbstractMemberWriter` line 207. The current threshold is 10, and
>>    some of the methods in `Collectors.html` were less and are now more.
>> 
>> 2. A number of files use an unescaped `>` in doc comments. These are now
>>    translated to `>` in the output.  The change in behavior is because of
>>    changing to use `html.Text` nodes instead of `html.RawHtml` nodes to represent
>>    user-provided text in `TextTree` nodes.
>>    Previously, the generated output was inconsistent in this area: most
>>    instances used `>`, but user-provided `>` was passed through.
>>    Now, the output is consistent. 
>>    FWIW, `tidy` prefers the use of `>` as well.
>
> Jonathan Gibbons has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Add missing resource and example of use
>    Fix Text and TextBuilder for Windows newlines
>  - Merge remote-tracking branch 'upstream/master' into 8288699.commentTagsToContent
>  - JDK-8288699: cleanup HTML tree in HtmlDocletWriter.commentTagsToContent

The changes in the additional commit c312880 look good to me.

-------------

Marked as reviewed by hannesw (Reviewer).

PR: https://git.openjdk.org/jdk/pull/9210