[External] : Re: Snippet specification feedback

Fri Mar 31 10:41:02 UTC 2023

> On 30 Mar 2023, at 12:40, Tagir Valeev <amaembo at gmail.com> wrote:
> 
> Hello, Pavel!
> 
> Thank you for the detailed answer.
> 
> > 1. Attribute value syntax. The spec says "An attribute value may be an identifier, unsigned integer, or enclosed in either single or double quote characters; no escape characters are supported". I assume that {@snippet class=pkg.Class} is a malformed tag, as pkg.Class is not an identifier and not quoted. Nevertheless, it's parsed by the javadoc tool and displayed, as if it were {@snippet class="pkg.Class"}. Similarly {@snippet file=pkg/Class.java} also works, while according to the spec it should not. Is it an implementation problem (implementation is more permissive than required by spec) or a spec problem? If such tag should be accepted, could you specify which non-identifier symbols exactly are allowed in unquoted values?
> 
> That seems like a pure specification issue. I think the Standard Doclet Specification (spec) switches between Java, HTML and maybe some other type of identifiers too freely, without proper indication. You can get a hint of that when the snippet section suddenly starts talking about _simple_ identifiers: as far as I know, Java does not categorise identifiers. _Names_ can be qualified and simple, identifiers cannot [^1][^2].
> 
> So which behavior is correct? Should we assume that the qualified name like pkg.Class should be accepted without quotes? How should the specification be read here instead?

The implementation behaviour is correct: / and . don't need quotes. The spec will need to catch up on that. Until it does, the rule of thumb is this: if syntax is unambiguous, it is legal.

- Examples of legal syntax

  - Neither / nor . are special symbols in this context:

    name=/
    name=.

  - {@snippet} does not have escapes:

    name=\n

  - {@snippet} does not support HTML:

    name=&
    name=<a> (currently not implemented, as unquoted < or > terminates the value, see below)

  - The ' or " quote is enclosed in a pair of alternative quotes:

    name='"'
    name="'"

- Examples of illegal syntax

  - An unpaired quote

    name="
    name='

  - A breaking whitespace character

    name=a b.txt

  - Dangling characters

    name="".txt

      Note that the lack of whitespace after the second " is not a problem. For example, while unclear, it is still legal:

        name=""lang="java"

      Or, if we had a boolean or enumerated attribute with a default value, this would be legal too:

        name=""enabled

      name="".txt is illegal because an "identifier" cannot start with .

- Examples of illegal or ambiguous syntax, which might be revisited later

  - Neither quote starts a quoted value

    name=a"b
    name=a'

  - Semicolon :

    name=:

  - Curlies whose balance is not }

    name={{}{}{}}
    name={}{

    Currently, both of the above curly examples behave similarly in a snippet _body_ and in other Standard Doclet tags, but not in unquoted snippet attributes. My recollection and peering into implementation suggest that many of the non-letter symbols are treated as unquoted value terminators:

        * because snippet attributes were influenced by familiar HTML attributes
        * to be consistent with other Standard Doclet tags

    Here's the relevant implementation bit:

        protected boolean isUnquotedAttrValueTerminator(char ch) {
            switch (ch) {
                case '\f': case '\n': case '\r': case '\t':
                case ' ':
                case '"': case '\'': case '`':
                case '=': case '<': case '>':
                    return true;
                default:
                    return false;
            }
        }

    In particular, unquoted semicolon terminates the value to avoid ambiguity with a start of the inline/hybrid snippet body.

    Note that *markup* attributes have slightly different rules:

	// Similar to https://html.spec.whatwg.org/multipage/syntax.html#unquoted
        protected boolean isUnquotedAttrValueTerminator(char ch) {
            switch (ch) {
                case ':': // indicates that the instruction relates to the next line
                case ' ': case '\t':
                case '"': case '\'': case '`':
                case '=': case '<': case '>':
                    return true;
            default:
                return false;
            }
        }

@snippet is currently the only Standard Doclet tag that has attributes of this form, which we should clearly specify.

>  > 2. "region" attribute. It's not explicitly specified that such an attribute exists. One can only guess from existing samples and javadoc tool implementation. Only "id", "lang", "class", and "file" attributes are mentioned in the specification. It would be nice to specify the "region" attribute as well.
> 
> Not sure what you mean here. There's (i) a subsection on regions and also (ii) individual snippet tag subsections that mention this attribute.
> 
> Subsection on regions speaks about how to declare a region inside the snippet. It doesn't say a word about how exactly to select the region to be rendered. Individual snippet tag subsections don't say either. I checked all the 34 occurrences of the 'region' word inside the document but haven't found anything about the "region" attribute on the @snippet tag. Note that the argument of the markup tag (such as @start) is not the same thing as the attribute of the @snippet tag, as @snippet tag is not a markup tag.

Okay, I see what you mean. You mean the region attribute of the {@snippet} tag, which I agree must be specified.

> The closest thing is see is:
> 
> > The file for an external snippet may contain multiple regions, to be referenced in different snippet tags, appearing in different parts of the documentation.
> 
> But it's never specified how exactly should I reference to multiple regions. It's never said that the syntax is {@snippet region=<region_name>}. Please correct me if I'm wrong.

Each @snippet tag may reference at most one region. This should be specified too.

> > 4. Markup tags placement. It's specified: "They are placed in // comments (or the equivalent in other languages or formats)". To me, it's quite a vague statement. Apparently, parser cannot understand every single existing file format in the world, so it may have no idea how comments are represented in the target file format. Moreover, target file format may have no formal specification. For example, if we have external snippet with .txt extension, which kind of comment prefix should we use to define regions? I tried
> > 
> > outside
> > # @start region="hello"
> > mytest
> > # @end
> > 
> > The javadoc tool fails to find the region in this case. But I can argue that # represents a comment line in my text files. I strongly feel that this part should be specified more precisely: either list all possible preceding symbols, or provide another exact description about which preceding characters are recognized as comment start. Should the parser behavior actually depend on the language (specified by 'lang' attribute or file extension)?
> 
> If I recall correctly, initially, we didn't want to allow authors to choose the EOL-comment marker. Instead, the marker was and is inferred from the type (the lang attribute) of the snippet. I cannot remember the rationale behind it. It might be because we felt it was "too much too soon", or it might have had something to do with the fact that EOL comments aren't simple: for example, in .properties, # or ! means a comment line [^3], not an end-of-line comment.
> 
> Naturally, snippets whose lang attribute has value "java" or "properties" assume such markers. Inline snippet whose lang attribute is unspecified uses //.
> 
> By the way, it's not specified that "properties" is a valid language name which should be recognized (or is it actually at the implementation discretion?). Which other languages should be recognized by default?

The spec hints at that, but does not expand: "Code fragments are typically Java source code, but they may also be fragments of properties files, source code in other languages, or plain text." This should be fixed.

> It's interesting that if I create the following properties file, then the region is properly recognized by javadoc tool:
> 
> a=b\
> # @start region=x
> c=d
> # @end
> 
> But in fact it's not a comment but part of a multiline string literal.

Right. This fact was noted early in the development. It's not just properties, Java has it too:

    String s = """
        // @highlight
    """;

While ambiguous markup is also possible with a string literal, it will likely be reported as broken (one cannot escape from trailing \n or ";) .

> Which is completely expected, as we cannot assume that the javadoc tool will actually lex every snippet file. But it also illustrates that the understanding of 'comment' by the javadoc tool differs from the actual understanding in the target language.

Right.

> And it really raises the question about special support of some files (# is recognized in *.properties but not in *.txt) which is an unspecified implementation detail.   Eventually, someone will need to parse an external or hybrid snippet that does not use any of those. We should carefuly think about it; I'm not ready to propose anything at this time.
> 
> Yes, exactly. I would just specify a possible list of recognized prefixes, like #, !, //, maybe 'rem' or something else. And the same prefixes should work in any file, regardless of the extension or specified language. This will be simple. If you want to specify the markup inside something which is not a valid comment in the target language (e.g., using # in Java), it's up to you.

Simple? Yes. Won't have unexpected or adverse effects? I'm not so sure.

> > 5. Markup tag arguments format. It's not specified completely. There is a sample `@start region=name` which implies that "name=value" format is used for arguments, but it's completely unclear which characters are allowed, which are not, whether the quotation is supported, are there any escape characters, etc. This is especially important, as arguments may contain regular expressions which are known to contain non-trivial characters. One may guess that markup tag arguments are formatted exactly like snippet tag attributes, but it would be nice to specify this explicitly.

This question has a lot in common with the question number 1, which I have already answered.

> Generally, what the snippet parser wants to avoid is ambiguities related to these symbols: ", ', }. Aside from those and the unicode escapes [^4], there are no escapes in snippets and only one special character combination to avoid in inline snippets, */, which wouldn't be an issue if doc comments were hosted in // instead of /* ... */ comments; but that's a discussion for another day.
> 
> This doesn't clarify much, even adds confusion. You've mentioned unicode escapes. I'm ok with them in the inline snippets but are they supported in external snippets (either Java, or non-Java)?

No, both the */ character sequence and unicode escapes are something imposed on an inline snippet only. Neither is an issue for external snippets.

> Can I use @start region="multi word region name"?

Yes, quotes strips whitespace off their token-separating power.

> Nothing is said in the spec, but it looks like this works. What if I want to use a double quotation character inside the multi-word value? Is it possible?

Use alternative quotes: quote ' with " and vice versa. Sorry, you cannot have both (remember "no escapes"). We'll address this question and the question number 1 in the spec.

>  > 6. Whitespace rendering. While it's said that "Markup comments do not appear in the generated output", the spec does not say anything about preceding whitespace. E.g., consider the following snippet:
> > 
> > /**
> >  * {@snippet lang=Java :
> >  * System.out.println(1);
> >  * // @replace substring=2 replacement=3:
> >  * System.out.println(2);
> >  * }
> >  */
> > 
> > We exclude the // comment from the rendering. However, there are four spaces before it. Should they be rendered? The javadoc tool does not render them. It would be nice to specify this behavior.
> 
> It spawned (an internal?) discussion just before the feature was integrated. Early experiments suggested that authors do expect standalone markup to disappear without a trace. So not only should the markup comment and any preceding whitespace go away, but the freed-up empty line should too.
> 
> In fact, the current implementation looks somewhat inconsistent, but it's unclear whether it's correct or not, as we have no specification. For example, consider the external snippet:
> 
> class Test {
> // @start region=r0
> // @start region=r1
> void one() {}
> // @end
> void two() {}
> // @end
> void three() {}
> }
> 
> I want to render the region r0. I write {@snippet file="Test.java" region=r0} and get in rendered HTML
> void one() {}
> void two() {}
> No empty line between one and two. Ok, it was stripped. But if I use a hybrid snippet and want to remove markup from the inline version, the javadoc tool requires me to write this:
> 
> /**
> * {@snippet file="Test.java" region=r0:
> *
> *
> * void one() {}
> *
> * void two() {}
> * }
> */
> 
> So, in this case empty lines are not stripped (including even the line that starts the r0 region). Moreover, now the rendering is spoiled, as it's rendered with empty lines as well. It looks like, for rendering an inline version is preferred over the external version. This is not specified, and these versions may produce different result.

Please file a bug for this.

> 
> > 7. Common indentation. It looks like the common indentation is stripped from the rendered snippet, similarly to text blocks. Is it an implementation detail or should be specified?
> 
> This is intentional and was spelled in JEP, but somehow went missing from the spec. This behavior is to facilitate pasting snippet content into a documentation comment without the need to reindent that content afterwards, which might be painful in some code editors.
> 
> Well, I was not entirely correct. This is spelled in the 'Inline snippets' subsection:
> 
> Surrounding whitespace is stripped from the content using String::stripIndent.
> 
> However, the same rule is applied to the external snippets. Moreover, stripping de-facto applied after the region is selected. stripIndent input is the contents of the selected region, not the contents of the whole snippet, both for external and inline snippets. So there's definitely a room to improve the specification. It's also not specified whether the stripping is applied before or after replacements. Looks like before. Example:
> 
> /**
> * {@snippet :
> * void test() { // @replace substring="void" replacement=" "
> * System.out.println("Hello");
> * }
> * }
> */
> 
> The rendered version is indented by two spaces. So one cannot say that stripping is applied before any markup processing or after any markup processing. De-facto it's applied in-between, after extracting the region but before applying everything else. This is not specified.

This needs to be specified better.

> There are more tricky questions like in which order several replacements or highlightings are applied, but probably it's ok to keep this part unspecified...

Generally, snippet markup was supposed to be used for simple and straightforward cases. Complex markup, such as overlapping regions with replacements, quickly becomes hard to predict and diagnose. Additionally, we have not rigorously checked the markup model: is it sound or can it be contradictory in some cases? So there's also that.

It's good that you are looking into this, because it pushes us to improve the spec. That said, even javadoc is qui We might want an API, such as the one described in this comment https://bugs.openjdk.org/browse/JDK-8304408?focusedCommentId=14570485&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14570485

It would be good if IDEs, REPLs or editors didn't have to maintain their own implementations of the spec and instead focused on a much easier task: having an up-to-date version of a library that provides jdk.javadoc services, such as (i) "render documentation for this symbol" or (ii) "lint/parse this doc comment". Think Language Server Protocol, but simpler and specifically for javadoc. I know that such a library will not eliminate the need for good spec (if nothing else, authors need spec to build their mental model of what's going on), but at least it will provide a great deal of uniformity and reduce maintenance effort.

I've filed a bug on spec issues you've discovered: https://bugs.openjdk.org/browse/JDK-8305353

-Pavel