Musings on 8232447: The javadoc parser ends the first sentence of a comment too soon

Wed May 13 17:38:32 UTC 2020

Roger,

Generally, I agree with you, and I completely agree with you if we 
regard your
comments as guidelines for Java SE doc comments.  But we also have to
support folk with maybe different standards, and to a lesser extent,
existing comments.

-- Jon

On 5/13/20 10:30 AM, Roger Riggs wrote:
> Hi,
>
> The first sentence is not just  any old sentence.
> It has a very specific role to play in the javadoc both to introduce 
> the class, method, feild, etc.
> AND to stand independently when used in a summary.
> That places a responsibility on the author to craft the sentence for 
> those purposes.
> The author should review their work in the generated javadoc, the 
> summary tables, etc.
> before feeling satisified and moving on.
>
> IMHO the first sentence should be short and to the point and not 
> include markup or
> extra explainatory phrases (such as e.g.).
> I don't think the tools should try to be as understanding as
> the reader or to compensate for the shortcomings of the author.
>
> $.02, Roger
>
>
> On 5/13/20 12:20 PM, Jonathan Gibbons wrote:
>> Pavel,
>>
>> Good write up.   You should link to this from 8232447.
>>
>> -- Jon
>>
>> On 5/13/20 7:44 AM, Pavel Rappo wrote:
>>> The issue:
>>>
>>>      https://bugs.openjdk.java.net/browse/JDK-8232447
>>>
>>> The more I think about this issue, the less I feel like solving it. 
>>> On the one hand, that problem is more complicated than it looks. On 
>>> the other hand, solving that problem doesn’t seem to be that 
>>> important since it’s about making our best-effort to improve 
>>> presentation. I'm leaning towards a solution that is good-enough 
>>> (possibly, the one that we already have) or reconsidering the 
>>> problem altogether.
>>>
>>> Here's what the problem is about. JavaDoc extracts summaries from 
>>> doc comments to place them on documentation pages to assist quick 
>>> scans by humans (think Table of Contents with descriptive headings). 
>>> Since JavaDoc does not understand the meaning of doc comments, to 
>>> extract a summary it relies on a convention [^0] that the first 
>>> sentence of a doc comment is that doc comment's summary. The problem 
>>> is that sometimes JavaDoc gets that first sentence wrong. For 
>>> example, according to JavaDoc, the first sentence of this doc 
>>> comment for `GraphicsEnvironment.preferProportionalFonts` [^1]
>>>
>>>> Indicates a preference for proportional over non-proportional (e.g. 
>>>> dual-spaced CJK fonts) fonts in the mapping of logical fonts to 
>>>> physical fonts. If the default mapping contains fonts for which 
>>>> proportional and non-proportional variants exist, then calling this 
>>>> method indicates the mapping should use a proportional variant.
>>> is
>>>
>>>> Indicates a preference for proportional over non-proportional (e.g.
>>> Now, why does this happen? Unless a more sophisticated mechanism is 
>>> requested or the locale's language is not English, JavaDoc uses a 
>>> simple "dot-space" algorithm to detect a sentence boundary. That 
>>> algorithm scans input from left to right looking for the dot 
>>> character followed by a whitespace. While it looks reasonable, in 
>>> the above case it is clearly inadequate.
>>>
>>> At this point, the reader might say: "Pfft. I know how to fix this." 
>>> Please bear with me and I'll show you that the problem is actually 
>>> multilayered. Not only does it include a sentence segmentation 
>>> algorithm [^2], but input that the algorithm is fed with, as well as 
>>> structure and quality of doc comments the input is created from.
>>>
>>> Instead of jumping head-first into augmenting the "dot-space" 
>>> algorithm with more heuristics, let's try one more thing. If 
>>> instructed to do so or the locale's language is not English, JavaDoc 
>>> uses `BreakIterator` [^3]. That `java.text` mechanism is 
>>> specifically designed to find various boundaries in text. When 
>>> `BreakIterator` is turned on (and after additional tweaking), 
>>> JavaDoc gets that first sentence about "proportional fonts" right, 
>>> however, other issues show up. Consider the following comment for 
>>> `FocusTraversalPolicy.getComponentAfter` [^4]:
>>>
>>>> Returns the Component that should receive the focus after 
>>>> aComponent. aContainer must be a focus cycle root of aComponent or 
>>>> a focus traversal policy provider.
>>> Here `BreakIterator` thinks that the whole paragraph is a single 
>>> sentence. This is because in English sentences begin with capital 
>>> letters. I should pause here. This is an important moment. While 
>>> some doc comments may indeed have typos, irregularities, or quality 
>>> issues, that doc comment about "aComponent" has none of those. It's 
>>> genuine and consists of easily recognizable by humans a couple of 
>>> sentences that do not, however, strictly abide by the rules of 
>>> English Grammar. To me, this (and other experiments with 
>>> `BreakIterator` I've done) shows that doc comments are not your 
>>> regular prose. Unsurprisingly, even a specialized text tool doesn't 
>>> grok it. (Which makes me wonder if that was one of the reasons why 
>>> `BreakIterator` is turned off by default.) Add indentation and 
>>> markup on top of that and you'll see why the ultimate form that 
>>> JavaDoc has to work with is not a string but something like this:
>>>
>>>      list size = 10
>>>       0 = {DCTree$DCStartElement} "<code>"
>>>       1 = {DCTree$DCText} "DOMLocator"
>>>       2 = {DCTree$DCEndElement} "</code>"
>>>       3 = {DCTree$DCText} " is an interface that describes a 
>>> location (e.g.\n where an error occurred).\n "
>>>       4 = {DCTree$DCStartElement} "<p>"
>>>       5 = {DCTree$DCText} "See also the "
>>>       6 = {DCTree$DCStartElement} "<a 
>>> href='http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407'>"
>>>       7 = {DCTree$DCText} "Document Object Model (DOM) Level 3 Core 
>>> Specification"
>>>       8 = {DCTree$DCEndElement} "</a>"
>>>       9 = {DCTree$DCText} "."
>>>
>>> Continuous text we see on a documentation page [^5] in a browser 
>>> comes from a representation such as the above, where the text can be 
>>> scattered across various AST nodes. This has interesting 
>>> implications. Consider the following doc comment (note the 
>>> whitespace after `comment.`):
>>>
>>>      /** This is the first sentence of this <i>comment. </i> This is 
>>> the second sentence. */
>>>
>>> Both simple "dot-space" algorithm and `BreakIterator` fail to 
>>> extract the first sentence here, producing the exact same result 
>>> consisting of both sentences. When `.` is moved immediately after 
>>> the closing `</i>`, they both extract the first sentence correctly. 
>>> However, the HTML output breaks (note the absence of closing `</i>`):
>>>
>>>      <div class="block">This is the first sentence of this 
>>> <i>comment.</div>
>>>
>>> This is partly because JavaDoc does not interpret HTML. Instead, it 
>>> uses a hybrid approach that applies a sentence segmentation 
>>> algorithm as an auxiliary step to individual text nodes (not 
>>> necessarily the whole text) while maintaining awareness of the 
>>> surrounding nodes. The fact that nodes preserve indentation and 
>>> formatting of the original doc comment makes things worse, as 
>>> whitespace is significant in sentence segmentation. No wonder 
>>> JavaDoc hardly sees the forest for the syntax trees! Perhaps, a more 
>>> careful way of doing that would be as follows:
>>>
>>>    1. Interpret markup as text.
>>>    2. Apply sentence segmentation to that text to find the first 
>>> sentence.
>>>    3. Map that first sentence back to markup to accurately extract 
>>> the corresponding portion.
>>>
>>> But even that won't magically solve all the issues as it's not 
>>> possible to decompose an arbitrary markup into independent 
>>> components. Consider the following doc comment:
>>>
>>>      /**
>>>       * <table class="comment">
>>>       *     <tr>
>>>       *        <td><i>Is this the first sentence?</i></td>
>>>       *        <td>Is this the second sentence?</td>
>>>       *     </tr>
>>>       *     <tr>...</tr>
>>>       *  </table>
>>>       ...
>>>
>>> Even if we find that "first sentence", can we safely extract it from 
>>> its table-context? And all this is just the structure layer of the 
>>> problem.
>>>
>>> Next layer is ambiguities. Unless extreme measures are taken those 
>>> are only resolvable by a human, sometimes by an expert in the area 
>>> the documentation relates to. Using abbreviations such as "etc.", 
>>> "e.g.", "i.e.", and "vs." is part of the issue. Early guides [^6] on 
>>> JavaDoc advised against using abbreviations. While I can see now one 
>>> of the reasons for this advice, people use them anyway. Some might 
>>> say that abbreviations can be more succinct and practical. For 
>>> instance, "etc." is shorter than "and so on", "and so forth", or 
>>> "and so on and so forth", and even pronounced literally as "et 
>>> cetera" in speech. Non-standard grammar in abbreviations aggravates 
>>> the issue. For instance, is "ie" a misspelt "i.e.", an initialism of 
>>> Internet Explorer, or a top-level domain name of The Republic of 
>>> Ireland? Or is "etc" is a misspelt "etc." or rather that `/etc` 
>>> directory from the UNIX Filesystem Hierarchy Standard? (When 
>>> scanning OpenJDK repo for occurrences of "etc." in comments, I found 
>>> that it can be written with the number of dots anywhere from 0 to 4. 
>>> The latter could be explained as ellipsis `...` followed by a dot 
>>> `.`, faulty keyboard, or perhaps a muscle twitch.)
>>>
>>> The final layer is typos and low-quality comments. What proportion 
>>> of doc comment follow that convention about the first sentence? What 
>>> proportion of comments respect grammar or have a meaningful 
>>> structure? While we shouldn't aim for a solution that rights the 
>>> wrongs of bad comments (i.e. Garbage In, Garbage Out), this is 
>>> something to keep in mind:
>>>
>>>      /**
>>>       * this function draws the border around each tab
>>>       * note that this function does now draw the background of the 
>>> tab.
>>>       * that is done elsewhere
>>>       ...
>>>       */
>>>       protected void paintTabBorder(Graphics g, int tabPlacement, ...
>>>
>>> There are things we can do to remediate that problem on the doc 
>>> comments side of the equation. Reasonable conventions that are 
>>> adhered to, better structure of doc comments, or hints. For example, 
>>> placing a newline or more than a single whitespace after the first 
>>> sentence. Or indicating the summary part of a doc comment with a 
>>> relatively new `{@summary}` tag. That said, all of those might have 
>>> problems of their own. They are intrusive and require to re-document 
>>> the existing code, which is not always possible. In addition to 
>>> that, `{@summary}` cannot contain nested markup, which is quite 
>>> often used in the summary part. For example
>>>
>>>      /**
>>>       * Returns the runtime class of this {@code Object}. The returned
>>>       * {@code Class} object is the object that is locked by {@code
>>>       * static synchronized} methods of the represented class.
>>>       ...
>>>       */
>>>       public final native Class<?> getClass();
>>>       or
>>>
>>>      /**
>>>       * An ordered collection (also known as a <i>sequence</i>).
>>>       ...
>>>       */
>>>      public interface List<E> extends Collection<E> { ...
>>>      Whatever a solution we choose, there's a risk of playing a 
>>> whac-a-mole game. Maybe we should aim for a solution that is 
>>> good-enough (possibly, the one that we already have) or reconsider 
>>> the problem altogether. For instance, do not extract the first 
>>> sentence (unless it can be done reliably). Instead, get the first N 
>>> characters and indicate continuation (e.g. using ellipsis `...`), or 
>>> use the complete doc-comment, whichever is shorter.
>>>
>>>
>>>
>>>
>>> To sum up, extracting sentences from a text written in a natural 
>>> language is anything but trivial and might require human judgement. 
>>> When done programmatically, occasional mistakes are inevitable. Doc 
>>> comments are barely text. While they have some structure, they also 
>>> use formatting, code, and markup. Hence, without pre-processing text 
>>> tools might not be applicable. Though JavaDoc could improve its 
>>> algorithms and doc comments could be more friendly, what we have 
>>> today works surprisingly well on the OpenJDK codebase. If this is 
>>> not enough, we could find another way of extracting a summary or 
>>> eliminate the need for it completely. That is, change the 
>>> presentation in such a way that it won't require summaries.
>>>
>>> -Pavel
>>>
>>> [^0]: 
>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#format
>>> [^1]: 
>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/GraphicsEnvironment.html#preferProportionalFonts()
>>> [^2]: https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation
>>> [^3]: 
>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/text/BreakIterator.html
>>> [^4]: 
>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.desktop/java/awt/FocusTraversalPolicy.html#getComponentAfter(java.awt.Container,java.awt.Component)
>>> [^5]: 
>>> https://docs.oracle.com/en/java/javase/14/docs/api/java.xml/org/w3c/dom/DOMLocator.html
>>> [^6]: 
>>> https://www.oracle.com/technical-resources/articles/java/javadoc-tool.html#styleguide
>>>
>>
>