Raw string literals -- restarting the discussion

Wed Jan 16 11:25:24 UTC 2019

Sorry for being slow to follow up on this, your reply didn't make it to my
inbox. I may have neglected to subscribe to amber-dev at the time...

>>* From: elias vasylenko <eliasvasylenko at gmail.com <https://mail.openjdk.java.net/mailman/listinfo/amber-dev>>
*>>* Subject: Re: Raw string literals -- restarting the discussion
*>>* Date: January 7, 2019 at 8:10:57 AM EST
*>>* To: amber-spec-comments at openjdk.java.net
<https://mail.openjdk.java.net/mailman/listinfo/amber-dev>
*>> >>>* At first blush, the simplicity of the Rust approach is attractive; just
*>>* let strings span multiple lines, with no new syntax.  The obvious
*>>* counter-arguments are pretty weak in the current age; if you code in IDE,
*>>* as most developers do, it is not easy to accidentally leave off a closing
*>>* quote, and the syntax highlighting will make this obvious in the event we
*>>* do so anyway.  But, if we look through the lens of our use cases -- such as
*>>* JSON snippets -- we see that this approach fails almost completely, because
*>>* you _still_ have to escape the quotes, and almost all multi-line snippets
*>>* will have quotes.  So, let's cross this off too.  The same applies to using
*>>* a letter prefix for multi-line strings; it doesn't address the primary use
*>>* case.
*>> >>* I'm a little confused about the argument to cross this off. Is this not
*>>* dismissing a solution to the multi-line string problem on the basis that it
*>>* doesn't also solve the raw string problem? Within the exploration of raw
*>>* strings and multi-line strings as separate features I think this reasoning
*>>* bears a little extra scrutiny.
*
> I think you’re confusing the Rust multi-line syntax with the Rust raw syntax.  What was crossed off here is the choice to simply let single-quoted string literals span multiple lines.
Not at all, I was indeed talking about letting singly-quoted literals
span multiple lines there. My email did meander around a little
between topics and I did *also* mention the Rust rawness syntax in
other places. I'll try to stay more focused this time.

*Regarding single-quote multi-line strings.*

> “Raw” is not a very well defined term, and the degree of “raw-ness” in so-called raw strings varies dramatically across languages.  So instead, we tried to frame this in terms of use cases.  The most important use case here is: embedded snippets of HTML,JSON,SQL, or XML.  And the “just let single-quote strings cross lines” approach fails dramatically here, because these are all expected to have many embedded double-quote characters.

But allowing single-quote strings to span multiple lines doesn't
preclude those use-cases from being addressed, it just suggests that
they should be addressed in another place.

There will be single-line strings containing lots of quotes where the
user doesn't want to escape them all. And there will be multi-line
strings with no quotes where the user wouldn't need any special new
syntax.

So rather than asking "is it sufficiently useful to allow the existing
string literal notation to span multiple lines", I want to ask "is it
sufficiently useful to continue to restrict the existing notation from
spanning multiple lines". The answer to that question may still be
yes; to be clear I have no particular objection to forbidding existing
string notation from spanning multiple lines. I only had a problem
with the given justification for it but it's not something I think
it's worth bogging down the discussion over.

In other words, as you have said before "This is about *simplifying*
the language model by removing gratuitous interactions between
features." ;)

*Regarding the definition of rawness.*

>>* I'd argue that the requirement for
*>>* unescaped quotes falls more naturally within the scope of the raw string
*>>* feature than the multi-line string feature:
*
> I understand why you would make this argument — initially, we fell into this subjective interpretation of “raw" as well.  But if you dig deeper, you’ll see that the reasons why various characters need escaping varies.  There are at least three:

> - Concerns over representation in source (tabs, newlines, non-ascii characters)
> - Concerns over conflict with the escape mechanism (backslash)
> - Concerns over conflict with the delimiter (quotes)

Yes I noticed this distinction also. Although the way I saw it is:

A) the escape character *bestows* special meaning to the following
sequence, e.g. embed unicode in ascii, newlines, etc.
B) the escape character *removes* the special meaning of the following
sequence, e.g. the escape character itself, the delimiter, newlines in
the properties file format

But this distinction isn't fundamental to the escaping mechanism, it's
a (rather sensible) choice that was made. We can define rawness in
terms that remove this distinction by putting everything into category
A), as I will try to do below.

(FWIW I expect that to most people *"raw"* will simply mean *"I want
paste/write this code-snippet/regex in a string literal without having
to faff around trying to figure out how to properly escape
everything".* Do we agree that this is a reasonable barometer for a
rawness feature? Whatever formal definition of rawness is selected I
hope that it captures this informal expectation.)

*A different approach.*

So bringing all this together I have a serious proposal for a
formalisation of rawness which I think is fairly unique, but which is
internally consistent and imo easy to understand and satisfies all
your use-cases.

There are only three rules:

- The escape-marker for a string literal can be designated by way of a
variable-length sequence of backslashes before its opening delimiter.
- A raw string is denoted by designating an escape-marker.
- In a raw string, all characters appearing in the source are exact
representation of string content unless they are escaped, *including
the delimiter character*.

So in other words, to avoid collisions with escape sequences, rather
than a variable-length delimiter, we choose a variable-length
escape-marker. And the delimiter is technically unchanged, but as per
the other rules it must be escaped in order to delimit the string.

    var s = \"Hello, World\"; // a raw string!

    var s = \"\"; // empty string

    var s = \""Hello, World"\"; // string starting and ending with quotes

    var s = \\\"[complicated regex with lots of escapes]\\\";

It's worth noting that choosing a single backslash as the escape
marker (i.e. \") gives almost exactly the same semantics as the
proposed """.

    var s = \"
      {
        "hello": "world"
      }
    \";

And if we want to avoid collisions with the escape marker we simply change it.

    var s = \\\"
      {
        "newline": "\n",
        "backslash": "\\"
      }
    \\\";

And in some hypothetical future where we have e.g. string
interpolation via escape sequences, we may still have access to this
feature without sacrificing the "rawness" of the rest of the string.

    var i = getMagicNumber();
    var s = \\\"
      {
        "newline": "\n",
        "backslash": "\\",
        "magic-number": "\\\$(i)"
      }
    \\\";

The rules of this scheme are imo straightforward for both humans and
parsers. Everything is raw by default (i.e. unless escaped).

Escaping the escape marker is obviously no longer necessary, since we
can just change it. So we can say that it is legal to precede the
escape marker with any sequence of backslashes. This I think addresses
the remaining edge-cases and makes any string representable in raw
form.

    var s = \"\\"; // a string containing a single backslash

Thoughts? I think it's fairly "Java-like", doesn't introduce too many
new concepts, and looks familiar due to adapting existing concepts and
notation. Hopefully not deceptively familiar.

On Mon, 7 Jan 2019 at 13:10, elias vasylenko <eliasvasylenko at gmail.com>
wrote:

> > At first blush, the simplicity of the Rust approach is attractive; just
> let strings span multiple lines, with no new syntax.  The obvious
> counter-arguments are pretty weak in the current age; if you code in IDE,
> as most developers do, it is not easy to accidentally leave off a closing
> quote, and the syntax highlighting will make this obvious in the event we
> do so anyway.  But, if we look through the lens of our use cases -- such as
> JSON snippets -- we see that this approach fails almost completely, because
> you _still_ have to escape the quotes, and almost all multi-line snippets
> will have quotes.  So, let's cross this off too.  The same applies to using
> a letter prefix for multi-line strings; it doesn't address the primary
> use case.
>
> I'm a little confused about the argument to cross this off. Is this not
> dismissing a solution to the multi-line string problem on the basis that it
> doesn't also solve the raw string problem? Within the exploration of raw
> strings and multi-line strings as separate features I think this reasoning
> bears a little extra scrutiny.
>
> Contrast, for example, using triple quote for multi-line and `r` prefix
> for raw:
>
>     var s1 = """
>       <xml>
>         <example />
>       </xml>
>     """;
>
>     var s2 = """
>       {
>         "json" : "example"
>       }
>     """;
>
>     var s3 = r"""
>       {
>         "quote" : "\"",
>         "backslash" : "\\"
>       }
>     """;
>
> I don't see what the triple quotes buy us over single quotes other than
> that they also serve the secondary purpose of a sort of poor-man's raw
> string. Is that really worth the extra inconsistency given that we also
> wish to have *actual* raw strings? I'd argue that the requirement for
> unescaped quotes falls more naturally within the scope of the raw string
> feature than the multi-line string feature:
>
>     var s1 = "
>       <xml>
>         <example />
>       </xml>
>     ";
>
>     var s2 = \"
>       {
>         "json" : "example"
>       }
>     "\;
>
>     // or with a variable-length component to the delimiter...
>     var s3 = \\\"
>       {
>         "quote" : "\"",
>         "backslash" : "\\"
>       }
>     "\\\;
>
> The \""\ syntax is just an example, the above arguments can equally be
> applied to e.g. the \+ \- proposal.
>
> That said I think there's also a minor danger of implying some sort of
> distinction between nonce-based delimiters and variable-length delimiters
> which doesn't necessarily exist. Isn't the latter just an example of the
> former but with a restricted format? Surely the reason the nonce-based
> approaches and e.g. the Rust approach avoid most of the edge cases suffered
> by the original backtick proposal is that the delimiters have both a
> variable portion *and* a single inner character.
>
> On Wed, 2 Jan 2019 at 18:22, Brian Goetz <brian.goetz at oracle.com> wrote:
>
>> As many of you saw, we pulled back the Raw String Literals feature from
>> JDK 12.  The public statement is here:
>>
>>
>> http://mail.openjdk.java.net/pipermail/jdk-dev/2018-December/002402.html
>>
>> So, let's restart the design discussion.  First, I want to enumerate some
>> of the process errors I think we made.
>>
>>  - We never really explored the full design space.  The initial proposal
>> had a reasonable syntactic strawman, and rather than explore the entire
>> space, we mostly followed the path of refining the initial strawman, and
>> stopped there.
>>  - We got caught in the "linear thinking" trap with respect to the design
>> center.  We started off thinking of this feature as "raw strings", of which
>> multi-line strings are an important sub-case, but in reality most of the
>> user pain is over dealing with multi-line snippets of HTML, JSON, XML, or
>> SQL, and raw-ness is secondary.  We never really made this turn.
>>  - We were too focused on getting the last 2% rather than the first 98%.
>> (Note that for many, perhaps most language features, the last 2% is
>> critical; for this one, which is entirely about syntactic convenience, it
>> is not.)
>>  Specifically, by focusing on self-embedding as a test of fitness rather
>> than more typical use cases, we ended up in a place that was both more
>> complex than necessary, and at the same time, still had prominent
>> anomalies.  (Anomalies are unavoidable if we are unwilling to take on a
>> super-ugly syntax, but we do have some control over how obvious and
>> prominent they are.)
>>
>> From my "language steward" perspective, my main problem is that the two
>> forms of string literals in the current proposal are gratuitously
>> unrelated.  They are syntactically unrelated (different delimiters and
>> delimiter arity rules), and semantically unrelated (one must be raw and
>> permits multiple lines; the other cannot be raw and cannot be multiple
>> line.)  I would prefer to have a single string literal feature, with some
>> sub-options for controlling raw-ness and/or line spanning -- with bonus
>> points if these are orthogonal aspects.  (As a sub-concern, I would
>> strongly prefer we not burn the backtick character as a delimiter; it
>> should be entirely possible to avoid this by building on the existing
>> string literal mechanism.)
>>
>> So, how should we evaluate success here?  This feature doesn't improve
>> the expressiveness or abstractive ability of the language at all -- it's
>> purely about syntactic convenience.  And, given that we've limped along for
>> 20+ years without it, it's lack can't be all _that_ problematic.  So let's
>> identify the use cases we care about most, and evaluate the feature through
>> the lens of how it helps those use cases.  In my opinion, these are:
>>
>>  - Multi-line snippets of JSON, HTML, XML, and SQL embedded in Java code
>> as string literals. (Other languages are used too, but these constitute the
>> majority.)  These currently require escaping for quotes and for newlines,
>> which means every such snippet requires substantial surgery.  This is
>> painful for code writers (though IDEs can do most of the lifting here), but
>> more importantly, is harder to read, and it is really easy to leave out a
>> `\n` and get the wrong result, and not have it be immediately noticeable.
>> We would like for most such snippets to be simply pastable without
>> modification.
>>  - Regular expressions and Windows paths routinely require escaping,
>> which again is easy to get wrong and hard to read.  (Regular expressions
>> are hard enough to read, we don't need to make it harder.)  These are
>> typically a single line.
>>
>> Given that this feature is pure convenience, we'd also like to avoid
>> excessive spending of our complexity budgets -- either language complexity
>> or teachability.  Grabbing for that last 2% at the expense of either of
>> these is not a good trade.
>>
>> Note too that there is no ideal answer here; we can see this quite
>> clearly by looking at the variety of choices other languages have made, and
>> each still has anomalies (e.g., python raw strings can't end with a
>> backslash) or forces ugly complexity on the reader (e.g., user-selected
>> nonces in C++ raw strings, or Rust's `#` characters).  This is truly a
>> "pick your poison" game.
>>
>> Let's remind ourselves of what other languages do in this area.  In all
>> these languages, raw strings can contain newlines; some have separate
>> features for multi-line escaped strings and multi-line raw strings.
>>
>>  - C simulates multi-line strings by having a continuation character
>> (backslash) in the last column, or by implicitly concatenating adjacent
>> string literals (`"raw" "string"`).  It does not support raw strings,
>> though there is a gcc extension that emulates C++ raw strings.
>>  - C++ supports multi-line strings through raw strings.  It denotes raw
>> strings with an `R` prefix before the quotes, and a user-selected nonce and
>> parentheses inside the quotes: `R"NONCE(raw string)NONCE"`.  The nonce may
>> be empty, but the parens are required.
>>  - Rust supports multi-line strings by simply allowing newline characters
>> in an ordinary string literal.  It separately supports raw string literals
>> with an `r` prefix, followed by a variable (can be zero) number of `#`
>> characters, a double quote, the raw string, a double quote, and the same
>> number of `#` characters: `r##"raw string"##`.
>>  - Python allows string literals to span multiple lines by using a
>> three-quote (`"""`) delimiter.  It allows raw string literals by prefixing
>> the string literal with `r`.  Its escaping rules for quotes in raw strings
>> are unusual; a backslash preceded by a quote escapes the quote, but leaves
>> the backspace in the string.  (Accordingly, a raw string cannot end with a
>> backslash.)
>>  - Ruby supports multi-line strings with here-docs, and raw strings using
>> the `%q()` construct: `q(raw string)`.
>>  - C#, like C++, support multi-line strings through raw strings.  A raw
>> string precedes the string literal with an `@` character: `@"raw string"`.
>>  - Scala and Kotlin, like C++ and C#, support multi-line strings through
>> raw strings.  A raw string is delimited with triple quotes: `"""raw
>> string"""`.
>>
>> Note too that there is also room for interpretation on the meaning of
>> "raw"; Python permits some escaping in raw strings, and Kotlin permit
>> interpolation in raw strings.
>>
>> We can divide the approaches roughly into three categories:
>>  - Those that use user-supplied nonces (C++, here-docs).  These can
>> render 100% of embedded strings, with the costs that come with nonces:
>> annoying to write, and imposing cognitive load to read (as nearly any
>> sequence can be a nonce.)
>>  - Those that use variable-sized delimiters (Rust, and our previous
>> proposal).  These are simpler, but will invariably have some anomalies.
>>  - Those that use fixed delimiters (C#, Scala).  These are simpler still,
>> and will have more anomalies.
>>
>> So, recapping our starting point and guidance:
>>
>>  - The primarily use case is multi-line snippets of JSON, HTML, XML, and
>> SQL.  It is rare that these require true-raw-ness, but they all commonly
>> have embedded quote characters.
>>  - The secondary use case is truly raw strings, of which the most common
>> offenders are small-ish -- regular expressions and windows paths.
>>  - We should start by trying to extend existing string literals to
>> support raw and/or multi-line strings.
>>
>> Some questions we need to answer:
>>
>>  - What are reasonable delimiter choices for raw and/or multi-line
>> strings?
>>  - Should the default treatment of multi-line strings be raw or escaped
>> (alternately, is this one feature or two)?
>>  - Is raw-ness a property of a string literal, or a state that can change
>> within the literal (i.e., with embedded start-raw/end-raw escape sequences)?
>>  - How do we embed delimiters in raw strings (escaping, doubling up,
>> concatenation)?
>>  - How far do we want to go to support embedding of delimiters?
>>
>> Let's start by asking how we might extend the current string literal
>> feature to support multi-line strings.  Currently, a string literal starts
>> with a double-quote, can span only a single line of source, and ends at the
>> first unescaped double quote.  How could we extend this to a multi-line
>> string literal?  Some possibilities include:
>>
>>  - Simply remove the constraint of "can only span a single line"; no
>> other change to delimiters is required (the Rust approach.)
>>  - Choose a different fixed delimiter, such as tripled quotes ("""),
>> doubled single-quotes (''...''), or a multi-character quote token
>> (`/"..."/`).
>>  - Use a modifier on the opening quote, such as `R"..."` or `@"..."`
>>  - Use an embedded escape sequence, such as `"\M..."`, to opt into
>> multi-line treatment
>>  - Use here-docs, with a fixed or user-providable nonce
>>
>> I think its reasonable to eliminate here-docs from consideration as these
>> are more typically associated with scripting languages.
>>
>> At first blush, the simplicity of the Rust approach is attractive; just
>> let strings span multiple lines, with no new syntax.  The obvious
>> counter-arguments are pretty weak in the current age; if you code in IDE,
>> as most developers do, it is not easy to accidentally leave off a closing
>> quote, and the syntax highlighting will make this obvious in the event we
>> do so anyway.  But, if we look through the lens of our use cases -- such as
>> JSON snippets -- we see that this approach fails almost completely, because
>> you _still_ have to escape the quotes, and almost all multi-line snippets
>> will have quotes.  So, let's cross this off too.  The same applies to using
>> a letter prefix for multi-line strings; it doesn't address the primary use
>> case.
>>
>> Note too that our primary use case admits a middle-ground option:
>> multi-line strings are not raw, but quotes need not be escaped.  This is a
>> possibility if the delimiter is anything other than a single double-quote
>> (`"`).
>>
>> So, some reasonable starting points on this front include:
>>
>>  - Just follow C#/Scala/Kotlin, where there's a single mechanism for both
>> raw and multi-line, delimited by triple-quotes.  Here, a single (or double)
>> embedded quote does not necessarily need to be escaped.
>>  - Use triple-quotes for non-raw multi-line string literals, and some
>> sort of additional way to select raw-ness for either single- or
>> triple-quoted string literals.  (Same comment about embedded quotes.)
>>  - Same, but use doubled or tripled single-quotes.
>>
>> Within the "multiple quote" options, we can separately choose between a
>> fixed number of quotes (e.g., 3) or a variable number (e.g., 3 or more, odd
>> only, etc.)  The trade-off here is about where the anomalies go; with the
>> variable-number approaches, it gets harder to start or end with the
>> delimiter character (while this is not necessarily a serious anomaly, but
>> it is a prominent one), and with the fixed approach, there is more need to
>> do something (escaping, concatenating, etc) the delimiter character (though
>> embedding triple-quotes is not all that common in our primary use cases).
>> Also, our IDE friends have pointed out that even numbers of quotes put the
>> IDE in a quandary as to whether the user has just typed the opening
>> delimiter, or both the opening and closing delimiters.
>>
>> Now, raw-ness.
>>
>> One option is to just say that multi-line strings are also raw.  We have
>> evidence that this is not totally unworkable, as several languages have
>> gone this way, but it does mean that for the use cases where the user wants
>> multi-line but not raw, they must resort either to concatenation, or
>> explicit escape processing (e.g., `"""foo""".escape()`)
>>
>> Another is to allow a prefix character to indicate raw-ness; `R"foo"` or
>> `R"""foo"""`.  The prefix character approach is more extensible to other
>> kinds of modes to string processing.
>>
>> Another option is to use a different delimiter, as the current proposal
>> does.  If we were to go this way, I'd suggest we consider double or triple
>> single-quote (which are currently illegal), rather than continuing with
>> backtick.
>>
>> A fourth option, one that has not yet been considered, is to say that
>> raw-ness is a _state_ of processing a string literal; string literals start
>> out escaped, but can drop into (and out of) raw-ness as they like:
>>
>>     String s = "This part is escaped\n, but this part\- is raw, and this
>> part\+ is escaped again."
>>     String path = "\-C:\bin\putty";
>>
>> This gets us where multi-line-ness and raw-ness are orthogonal properties
>> of string literals -- without requiring any new delimiters.
>>
>> So, how to proceed?  First, let's try to avoid focusing on our own
>> personal preferences, or be distracted by unfamiliarity, and remember that
>> our job here is to get to a design that's best for _tomorrow's_ Java
>> developers and source base.  (That means that, for example, we can't allow
>> ourselves to be distracted by the fact that, say, embedded "\-" or `R"..."`
>> is unfamiliar today.  It will be familiar tomorrow, if we decide that's
>> what would be best.)
>>
>> Here's what would be super-useful:
>>
>>  - Data that supports or refutes the claim that our primary use cases are
>> embedded JSON, HTML, XML, and SQL.
>>  - Use cases we've left out, for which we can discuss whether we want to
>> incorporate them into our goals.
>>  - Data (either Java or non-Java) on the use of various flavors of
>> strings (raw, multi-line, etc) in real codebases, which might be useful to
>> help determine, for example, whether raw and multi-line should be lumped
>> into the same bucket or not.
>>
>> The bike shed is open (but please show up with structural members, not
>> just paint.)
>>
>>
>>