Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Wed May 13 21:01:05 UTC 2020

Agreed that we should have a "Guidelines for writing good doc comments" 
document, somewhere.
I'll leave it to others to decide if it is in scope for the proposed 
Developers Guide project.

-- Jon

On 5/13/20 12:50 PM, Roger Riggs wrote:
> Hi Pavel,
>
> I'd suggest that it is in the scope of the proposed Developers Guide 
> project
> to describe how to write specs and documentation for OpenJDK.
> Personally, I lean toward the "should" side of things giving developer
> leeway to communicate effectively about their APIs.
>
> Roger
>
>
> On 5/13/20 2:41 PM, Pavel Rappo wrote:
>> Thanks for chiming in, Roger.
>>
>>> On 13 May 2020, at 18:30, Roger Riggs <Roger.Riggs at oracle.com> wrote:
>>>
>>> Hi,
>>>
>>> The first sentence is not just  any old sentence.
>>> It has a very specific role to play in the javadoc both to introduce 
>>> the class, method, feild, etc.
>>> AND to stand independently when used in a summary.
>>> That places a responsibility on the author to craft the sentence for 
>>> those purposes.
>>> The author should review their work in the generated javadoc, the 
>>> summary tables, etc.
>>> before feeling satisified and moving on.
>>> IMHO the first sentence should be short and to the point and not 
>>> include markup or
>>> extra explainatory phrases (such as e.g.).
>> 1. Just to be clear. Does this fall into the "SHOULD" or the "MUST" 
>> category? If the latter, then this MUST be specified. Probably 
>> differently that what we have today in the Documentation Comment 
>> Specification for the Standard Doclet [^1]:
>>
>>> The first sentence of the initial description should be a summary 
>>> sentence that contains a concise but complete description of the 
>>> declared entity. Descriptive text may include HTML tags and 
>>> entities, and inline tags as described below.
>> If this is the former, then we need more guidance. Perhaps plenty of 
>> examples, including DOs and DON'Ts, as summarizing a complete doc 
>> comment into a single sentence can be challenging. Especially if we 
>> disallow markup, restrict formatting, and disapprove familiar tools, 
>> such as abbreviations, which are freely used in written language.
>>
>> Come to think of it, if it is that important then we should think of 
>> teaching doclint (or some other tool) to check that.
>>
>> 2. We should think about what to do with doc comments not following 
>> those rules (conventions?) in the OpenJDK codebase.
>>
>>> I don't think the tools should try to be as understanding as
>>> the reader or to compensate for the shortcomings of the author.
>> Neither do I and I believe I made my position clear in that text.
>>
>> -Pavel
>>
>> [^1]: 
>> https://docs.oracle.com/en/java/javase/14/docs/specs/javadoc/doc-comment-spec.html
>>
>>> $.02, Roger
>>>
>>>
>>> On 5/13/20 12:20 PM, Jonathan Gibbons wrote:
>>>> Pavel,
>>>>
>>>> Good write up.   You should link to this from 8232447.
>>>>
>>>> -- Jon
>>>>
>>>> On 5/13/20 7:44 AM, Pavel Rappo wrote:
>>>>> The issue:
>>>>>
>>>>>       https://bugs.openjdk.java.net/browse/JDK-8232447
>>>>>
>>>>> The more I think about this issue, the less I feel like solving 
>>>>> it. On the one hand, that problem is more complicated than it 
>>>>> looks. On the other hand, solving that problem doesn’t seem to be 
>>>>> that important since it’s about making our best-effort to improve 
>>>>> presentation. I'm leaning towards a solution that is good-enough 
>>>>> (possibly, the one that we already have) or reconsidering the 
>>>>> problem altogether.
>>>>>
>>>>> Here's what the problem is about. JavaDoc extracts summaries from 
>>>>> doc comments to place them on documentation pages to assist quick 
>>>>> scans by humans (think Table of Contents with descriptive 
>>>>> headings). Since JavaDoc does not understand the meaning of doc 
>>>>> comments, to extract a summary it relies on a convention [^0] that 
>>>>> the first sentence of a doc comment is that doc comment's summary. 
>>>>> The problem is that sometimes JavaDoc gets that first sentence 
>>>>> wrong. For example, according to JavaDoc, the first sentence of 
>>>>> this doc comment for `GraphicsEnvironment.preferProportionalFonts` 
>>>>> [^1]
>>>>>
>>>>>> Indicates a preference for proportional over non-proportional 
>>>>>> (e.g. dual-spaced CJK fonts) fonts in the mapping of logical 
>>>>>> fonts to physical fonts. If the default mapping contains fonts 
>>>>>> for which proportional and non-proportional variants exist, then 
>>>>>> calling this method indicates the mapping should use a 
>>>>>> proportional variant.
>>>>> is
>>>>>
>>>>>> Indicates a preference for proportional over non-proportional (e.g.
>>>>> Now, why does this happen? Unless a more sophisticated mechanism 
>>>>> is requested or the locale's language is not English, JavaDoc uses 
>>>>> a simple "dot-space" algorithm to detect a sentence boundary. That 
>>>>> algorithm scans input from left to right looking for the dot 
>>>>> character followed by a whitespace. While it looks reasonable, in 
>>>>> the above case it is clearly inadequate.
>>>>>
>>>>> At this point, the reader might say: "Pfft. I know how to fix 
>>>>> this." Please bear with me and I'll show you that the problem is 
>>>>> actually multilayered. Not only does it include a sentence 
>>>>> segmentation algorithm [^2], but input that the algorithm is fed 
>>>>> with, as well as structure and quality of doc comments the input 
>>>>> is created from.
>>>>>
>>>>> Instead of jumping head-first into augmenting the "dot-space" 
>>>>> algorithm with more heuristics, let's try one more thing. If 
>>>>> instructed to do so or the locale's language is not English, 
>>>>> JavaDoc uses `BreakIterator` [^3]. That `java.text` mechanism is 
>>>>> specifically designed to find various boundaries in text. When 
>>>>> `BreakIterator` is turned on (and after additional tweaking), 
>>>>> JavaDoc gets that first sentence about "proportional fonts" right, 
>>>>> however, other issues show up. Consider the following comment for 
>>>>> `FocusTraversalPolicy.getComponentAfter` [^4]:
>>>>>
>>>>>> Returns the Component that should receive the focus after 
>>>>>> aComponent. aContainer must be a focus cycle root of aComponent 
>>>>>> or a focus traversal policy provider.
>>>>> Here `BreakIterator` thinks that the whole paragraph is a single 
>>>>> sentence. This is because in English sentences begin with capital 
>>>>> letters. I should pause here. This is an important moment. While 
>>>>> some doc comments may indeed have typos, irregularities, or 
>>>>> quality issues, that doc comment about "aComponent" has none of 
>>>>> those. It's genuine and consists of easily recognizable by humans 
>>>>> a couple of sentences that do not, however, strictly abide by the 
>>>>> rules of English Grammar. To me, this (and other experiments with 
>>>>> `BreakIterator` I've done) shows that doc comments are not your 
>>>>> regular prose. Unsurprisingly, even a specialized text tool 
>>>>> doesn't grok it. (Which makes me wonder if that was one of the 
>>>>> reasons why `BreakIterator` is turned off by default.) Add 
>>>>> indentation and markup on top of that and you'll see why the 
>>>>> ultimate form that JavaDoc has to work with is not a string but 
>>>>> something like this:
>>>>>
>>>>>       list size = 10
>>>>>        0 = {DCTree$DCStartElement} "<code>"
>>>>>        1 = {DCTree$DCText} "DOMLocator"
>>>>>        2 = {DCTree$DCEndElement} "</code>"
>>>>>        3 = {DCTree$DCText} " is an interface that describes a 
>>>>> location (e.g.\n where an error occurred).\n "
>>>>>        4 = {DCTree$DCStartElement} "<p>"
>>>>>        5 = {DCTree$DCText} "See also the "
>>>>>        6 = {DCTree$DCStartElement} "<a 
>>>>> href='http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407'>"
>>>>>        7 = {DCTree$DCText} "Document Object Model (DOM) Level 3 
>>>>> Core Specification"
>>>>>        8 = {DCTree$DCEndElement} "</a>"
>>>>>        9 = {DCTree$DCText} "."
>>>>>
>>>>> Continuous text we see on a documentation page [^5] in a browser 
>>>>> comes from a representation such as the above, where the text can 
>>>>> be scattered across various AST nodes. This has interesting 
>>>>> implications. Consider the following doc comment (note the 
>>>>> whitespace after `comment.`):
>>>>>
>>>>>       /** This is the first sentence of this <i>comment. </i> This 
>>>>> is the second sentence. */
>>>>>
>>>>> Both simple "dot-space" algorithm and `BreakIterator` fail to 
>>>>> extract the first sentence here, producing the exact same result 
>>>>> consisting of both sentences. When `.` is moved immediately after 
>>>>> the closing `</i>`, they both extract the first sentence 
>>>>> correctly. However, the HTML output breaks (note the absence of 
>>>>> closing `</i>`):
>>>>>
>>>>>       <div class="block">This is the first sentence of this 
>>>>> <i>comment.</div>
>>>>>
>>>>> This is partly because JavaDoc does not interpret HTML. Instead, 
>>>>> it uses a hybrid approach that applies a sentence segmentation 
>>>>> algorithm as an auxiliary step to individual text nodes (not 
>>>>> necessarily the whole text) while maintaining awareness of the 
>>>>> surrounding nodes. The fact that nodes preserve indentation and 
>>>>> formatting of the original doc comment makes things worse, as 
>>>>> whitespace is significant in sentence segmentation. No wonder 
>>>>> JavaDoc hardly sees the forest for the syntax trees! Perhaps, a 
>>>>> more careful way of doing that would be as follows:
>>>>>
>>>>>     1. Interpret markup as text.
>>>>>     2. Apply sentence segmentation to that text to find the first 
>>>>> sentence.
>>>>>     3. Map that first sentence back to markup to accurately 
>>>>> extract the corresponding portion.
>>>>>
>>>>> But even that won't magically solve all the issues as it's not 
>>>>> possible to decompose an arbitrary markup into independent 
>>>>> components. Consider the following doc comment:
>>>>>
>>>>>       /**
>>>>>        * <table class="comment">
>>>>>        *     <tr>
>>>>>        *        <td><i>Is this the first sentence?</i></td>
>>>>>        *        <td>Is this the second sentence?</td>
>>>>>        *     </tr>
>>>>>        *     <tr>...</tr>
>>>>>        *  </table>
>>>>>        ...
>>>>>
>>>>> Even if we find that "first sentence", can we safely extract it 
>>>>> from its table-context? And all this is just the structure layer 
>>>>> of the problem.
>>>>>
>>>>> Next layer is ambiguities. Unless extreme measures are taken those 
>>>>> are only resolvable by a human, sometimes by an expert in the area 
>>>>> the documentation relates to. Using abbreviations such as "etc.", 
>>>>> "e.g.", "i.e.", and "vs." is part of the issue. Early guides [^6] 
>>>>> on JavaDoc advised against using abbreviations. While I can see 
>>>>> now one of the reasons for this advice, people use them anyway. 
>>>>> Some might say that abbreviations can be more succinct and 
>>>>> practical. For instance, "etc." is shorter than "and so on", "and 
>>>>> so forth", or "and so on and so forth", and even pronounced 
>>>>> literally as "et cetera" in speech. Non-standard grammar in 
>>>>> abbreviations aggravates the issue. For instance, is "ie" a 
>>>>> misspelt "i.e.", an initialism of Internet Explorer, or a 
>>>>> top-level domain name of The Republic of Ireland? Or is "etc" is a 
>>>>> misspelt "etc." or rather that `/etc` directory from the UNIX 
>>>>> Filesystem Hierarchy Standard? (When scanning OpenJDK repo for 
>>>>> occurrences of "etc." in comments, I found that it can be written 
>>>>> with the number of dots anywhere from 0 to 4. The latter could be 
>>>>> explained as ellipsis `...` followed by a dot `.`, faulty 
>>>>> keyboard, or perhaps a muscle twitch.)
>>>>>
>>>>> The final layer is typos and low-quality comments. What proportion 
>>>>> of doc comment follow that convention about the first sentence? 
>>>>> What proportion of comments respect grammar or have a meaningful 
>>>>> structure? While we shouldn't aim for a solution that rights the 
>>>>> wrongs of bad comments (i.e. Garbage In, Garbage Out), this is 
>>>>> something to keep in mind:
>>>>>
>>>>>       /**
>>>>>        * this function draws the border around each tab
>>>>>        * note that this function does now draw the background of 
>>>>> the tab.
>>>>>        * that is done elsewhere
>>>>>        ...
>>>>>        */
>>>>>        protected void paintTabBorder(Graphics g, int tabPlacement, 
>>>>> ...
>>>>>
>>>>> There are things we can do to remediate that problem on the doc 
>>>>> comments side of the equation. Reasonable conventions that are 
>>>>> adhered to, better structure of doc comments, or hints. For 
>>>>> example, placing a newline or more than a single whitespace after 
>>>>> the first sentence. Or indicating the summary part of a doc 
>>>>> comment with a relatively new `{@summary}` tag. That said, all of 
>>>>> those might have problems of their own. They are intrusive and 
>>>>> require to re-document the existing code, which is not always 
>>>>> possible. In addition to that, `{@summary}` cannot contain nested 
>>>>> markup, which is quite often used in the summary part. For example
>>>>>
>>>>>       /**
>>>>>        * Returns the runtime class of this {@code Object}. The 
>>>>> returned
>>>>>        * {@code Class} object is the object that is locked by {@code
>>>>>        * static synchronized} methods of the represented class.
>>>>>        ...
>>>>>        */
>>>>>        public final native Class<?> getClass();
>>>>>        or
>>>>>
>>>>>       /**
>>>>>        * An ordered collection (also known as a <i>sequence</i>).
>>>>>        ...
>>>>>        */
>>>>>       public interface List<E> extends Collection<E> { ...
>>>>>       Whatever a solution we choose, there's a risk of playing a 
>>>>> whac-a-mole game. Maybe we should aim for a solution that is 
>>>>> good-enough (possibly, the one that we already have) or reconsider 
>>>>> the problem altogether. For instance, do not extract the first 
>>>>> sentence (unless it can be done reliably). Instead, get the first 
>>>>> N characters and indicate continuation (e.g. using ellipsis 
>>>>> `...`), or use the complete doc-comment, whichever is shorter.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> To sum up, extracting sentences from a text written in a natural 
>>>>> language is anything but trivial and might require human 
>>>>> judgement. When done programmatically, occasional mistakes are 
>>>>> inevitable. Doc comments are barely text. While they have some 
>>>>> structure, they also use formatting, code, and markup. Hence, 
>>>>> without pre-processing text tools might not be applicable. Though 
>>>>> JavaDoc could improve its algorithms and doc comments could be 
>>>>> more friendly, what we have today works surprisingly well on the 
>>>>> OpenJDK codebase. If this is not enough, we could find another way 
>>>>> of extracting a summary or eliminate the need for it completely. 
>>>>> That is, change the presentation in such a way that it won't 
>>>>> require summaries.
>>>>>
>>>>> -Pavel
>>>>>
>>>>> [^0]: 
>>>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format
>>>>> [^1]: 
>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/GraphicsEnvironment.html#preferProportionalFonts()
>>>>> [^2]: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
>>>>> [^3]: 
>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/BreakIterator.html
>>>>> [^4]: 
>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/FocusTraversalPolicy.html#getComponentAfter(java.awt.Container,java.awt.Component)
>>>>> [^5]: 
>>>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.xml/org/w3c/dom/DOMLocator.html
>>>>> [^6]: 
>>>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguide
>>>>>
>
>