RFR: JDK-8298405: Support Markdown in Documentation Comments [v4]

Wed Nov 15 00:12:31 UTC 2023

On Wed, 8 Nov 2023 10:59:58 GMT, Pavel Rappo <prappo at openjdk.org> wrote:

>> Jonathan Gibbons has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix whitespace
>
> test/langtools/tools/javac/doctree/FirstSentenceTest.java line 423:
> 
>> 421: DocComment[DOC_COMMENT, pos:0
>> 422:   firstSentence: 1
>> 423:     RawText[MARKDOWN, pos:0, abc.|def.]
> 
> It took me a while to understand why this test expects the first sentence to include the second line:
> 
>     ///abc.
>     ///def.
>     void simpleMarkdown() { }
> 
> It's not because it's Markdown or the new `///` comment syntax. It's because the breakiterator thinks that `abc.\ndef.` is one sentence.
> 
> Now, in this same file, on [line 123](https://github.com/openjdk/jdk/blob/128363bf3b57dfa05b3807271b47851733c1afb9/test/langtools/tools/javac/doctree/FirstSentenceTest.java#L119-L142), there's this test:
> 
>     /**
>      * abc def ghi.
>      * jkl mno pqr
>      */
>     void dot_newline() { }
> 
> If you compare its expectation to that of simpleMarkdown(), you'll see that the first sentence consists of the first line only:
> 
>     BREAK_ITERATOR
>     DocComment[DOC_COMMENT, pos:1
>       firstSentence: 1
>         Text[TEXT, pos:1, abc_def_ghi.]
>       body: 1
>         Text[TEXT, pos:15, jkl_mno_pqr]
>       block tags: empty
>     ]
> 
> So, why does it seem to work differently for `///` and `/** */` comments? It turns out that a whitespace character immediately after newline is important for breakiterator to do its thing:
> ```    
>     abc def ghi.\n jkl mno pqr
>                   ^
> 
> The problem with the simpleMarkdown test is that we cannot just indent the comment. That indentation will be deemed incidental whitespace and, thus, will removed. For example, suppose we do this:
> 
>     /// abc.
>     /// def.
>     void simpleMarkdown() { }
> 
> The breakiterator will still see `abc.\ndef.`. Of course, we could do this:
> 
>     ///abc.
>     /// def.
>     void simpleMarkdown() { }
> 
> But it would be a different comment.
> 
> There's another test in this PR, [TestMarkdown.testFirstSentence](https://github.com/openjdk/jdk/blob/128363bf3b57dfa05b3807271b47851733c1afb9/test/langtools/jdk/javadoc/doclet/testMarkdown/TestMarkdown.java#L68-L83). There, the first sentence consists only of the first line. This is likely because the second line starts with a capital letter.

Good and interesting catch.

As a general matter, there are two scenarios to consider: using a system-provided `BreakIterator` and using the fallback "simple iterator" in DocTreeMaker.getSentenceBreak.  

The latter is just a period followed by a single whitespace, or a paragraph break if there is no period.

The system-provided `BreakIterator` is locale specific. I cannot easily find a public spec of the sentence breaker for the US locale, but experiments indicate it wants one of 

* period, single whitespace, uppercase letter
* period, multiple whitespace

Arguably, the test is lazy/broken for using all lower case, which is an atypical form for sentences and use of a sentence breaker. I'll modify at least the new parts of the test (using Markdown comments.). We can separately decide whether to likewise modify the original parts (using traditional comments) or whether to do that as a separate `noreg-self` bug fix for this test.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16388#discussion_r1393466027