Raw string literals -- restarting the discussion

Mon Jan 7 13:10:57 UTC 2019

> At first blush, the simplicity of the Rust approach is attractive; just
let strings span multiple lines, with no new syntax.  The obvious
counter-arguments are pretty weak in the current age; if you code in IDE,
as most developers do, it is not easy to accidentally leave off a closing
quote, and the syntax highlighting will make this obvious in the event we
do so anyway.  But, if we look through the lens of our use cases -- such as
JSON snippets -- we see that this approach fails almost completely, because
you _still_ have to escape the quotes, and almost all multi-line snippets
will have quotes.  So, let's cross this off too.  The same applies to using
a letter prefix for multi-line strings; it doesn't address the primary use
case.

I'm a little confused about the argument to cross this off. Is this not
dismissing a solution to the multi-line string problem on the basis that it
doesn't also solve the raw string problem? Within the exploration of raw
strings and multi-line strings as separate features I think this reasoning
bears a little extra scrutiny.

Contrast, for example, using triple quote for multi-line and `r` prefix for
raw:

    var s1 = """
      <xml>
        <example />
      </xml>
    """;

    var s2 = """
      {
        "json" : "example"
      }
    """;

    var s3 = r"""
      {
        "quote" : "\"",
        "backslash" : "\\"
      }
    """;

I don't see what the triple quotes buy us over single quotes other than
that they also serve the secondary purpose of a sort of poor-man's raw
string. Is that really worth the extra inconsistency given that we also
wish to have *actual* raw strings? I'd argue that the requirement for
unescaped quotes falls more naturally within the scope of the raw string
feature than the multi-line string feature:

    var s1 = "
      <xml>
        <example />
      </xml>
    ";

    var s2 = \"
      {
        "json" : "example"
      }
    "\;

    // or with a variable-length component to the delimiter...
    var s3 = \\\"
      {
        "quote" : "\"",
        "backslash" : "\\"
      }
    "\\\;

The \""\ syntax is just an example, the above arguments can equally be
applied to e.g. the \+ \- proposal.

That said I think there's also a minor danger of implying some sort of
distinction between nonce-based delimiters and variable-length delimiters
which doesn't necessarily exist. Isn't the latter just an example of the
former but with a restricted format? Surely the reason the nonce-based
approaches and e.g. the Rust approach avoid most of the edge cases suffered
by the original backtick proposal is that the delimiters have both a
variable portion *and* a single inner character.

On Wed, 2 Jan 2019 at 18:22, Brian Goetz <brian.goetz at oracle.com> wrote:

> As many of you saw, we pulled back the Raw String Literals feature from
> JDK 12.  The public statement is here:
>
>
> http://mail.openjdk.java.net/pipermail/jdk-dev/2018-December/002402.html
>
> So, let's restart the design discussion.  First, I want to enumerate some
> of the process errors I think we made.
>
>  - We never really explored the full design space.  The initial proposal
> had a reasonable syntactic strawman, and rather than explore the entire
> space, we mostly followed the path of refining the initial strawman, and
> stopped there.
>  - We got caught in the "linear thinking" trap with respect to the design
> center.  We started off thinking of this feature as "raw strings", of which
> multi-line strings are an important sub-case, but in reality most of the
> user pain is over dealing with multi-line snippets of HTML, JSON, XML, or
> SQL, and raw-ness is secondary.  We never really made this turn.
>  - We were too focused on getting the last 2% rather than the first 98%.
> (Note that for many, perhaps most language features, the last 2% is
> critical; for this one, which is entirely about syntactic convenience, it
> is not.)
>  Specifically, by focusing on self-embedding as a test of fitness rather
> than more typical use cases, we ended up in a place that was both more
> complex than necessary, and at the same time, still had prominent
> anomalies.  (Anomalies are unavoidable if we are unwilling to take on a
> super-ugly syntax, but we do have some control over how obvious and
> prominent they are.)
>
> From my "language steward" perspective, my main problem is that the two
> forms of string literals in the current proposal are gratuitously
> unrelated.  They are syntactically unrelated (different delimiters and
> delimiter arity rules), and semantically unrelated (one must be raw and
> permits multiple lines; the other cannot be raw and cannot be multiple
> line.)  I would prefer to have a single string literal feature, with some
> sub-options for controlling raw-ness and/or line spanning -- with bonus
> points if these are orthogonal aspects.  (As a sub-concern, I would
> strongly prefer we not burn the backtick character as a delimiter; it
> should be entirely possible to avoid this by building on the existing
> string literal mechanism.)
>
> So, how should we evaluate success here?  This feature doesn't improve the
> expressiveness or abstractive ability of the language at all -- it's purely
> about syntactic convenience.  And, given that we've limped along for 20+
> years without it, it's lack can't be all _that_ problematic.  So let's
> identify the use cases we care about most, and evaluate the feature through
> the lens of how it helps those use cases.  In my opinion, these are:
>
>  - Multi-line snippets of JSON, HTML, XML, and SQL embedded in Java code
> as string literals. (Other languages are used too, but these constitute the
> majority.)  These currently require escaping for quotes and for newlines,
> which means every such snippet requires substantial surgery.  This is
> painful for code writers (though IDEs can do most of the lifting here), but
> more importantly, is harder to read, and it is really easy to leave out a
> `\n` and get the wrong result, and not have it be immediately noticeable.
> We would like for most such snippets to be simply pastable without
> modification.
>  - Regular expressions and Windows paths routinely require escaping, which
> again is easy to get wrong and hard to read.  (Regular expressions are hard
> enough to read, we don't need to make it harder.)  These are typically a
> single line.
>
> Given that this feature is pure convenience, we'd also like to avoid
> excessive spending of our complexity budgets -- either language complexity
> or teachability.  Grabbing for that last 2% at the expense of either of
> these is not a good trade.
>
> Note too that there is no ideal answer here; we can see this quite clearly
> by looking at the variety of choices other languages have made, and each
> still has anomalies (e.g., python raw strings can't end with a backslash)
> or forces ugly complexity on the reader (e.g., user-selected nonces in C++
> raw strings, or Rust's `#` characters).  This is truly a "pick your poison"
> game.
>
> Let's remind ourselves of what other languages do in this area.  In all
> these languages, raw strings can contain newlines; some have separate
> features for multi-line escaped strings and multi-line raw strings.
>
>  - C simulates multi-line strings by having a continuation character
> (backslash) in the last column, or by implicitly concatenating adjacent
> string literals (`"raw" "string"`).  It does not support raw strings,
> though there is a gcc extension that emulates C++ raw strings.
>  - C++ supports multi-line strings through raw strings.  It denotes raw
> strings with an `R` prefix before the quotes, and a user-selected nonce and
> parentheses inside the quotes: `R"NONCE(raw string)NONCE"`.  The nonce may
> be empty, but the parens are required.
>  - Rust supports multi-line strings by simply allowing newline characters
> in an ordinary string literal.  It separately supports raw string literals
> with an `r` prefix, followed by a variable (can be zero) number of `#`
> characters, a double quote, the raw string, a double quote, and the same
> number of `#` characters: `r##"raw string"##`.
>  - Python allows string literals to span multiple lines by using a
> three-quote (`"""`) delimiter.  It allows raw string literals by prefixing
> the string literal with `r`.  Its escaping rules for quotes in raw strings
> are unusual; a backslash preceded by a quote escapes the quote, but leaves
> the backspace in the string.  (Accordingly, a raw string cannot end with a
> backslash.)
>  - Ruby supports multi-line strings with here-docs, and raw strings using
> the `%q()` construct: `q(raw string)`.
>  - C#, like C++, support multi-line strings through raw strings.  A raw
> string precedes the string literal with an `@` character: `@"raw string"`.
>  - Scala and Kotlin, like C++ and C#, support multi-line strings through
> raw strings.  A raw string is delimited with triple quotes: `"""raw
> string"""`.
>
> Note too that there is also room for interpretation on the meaning of
> "raw"; Python permits some escaping in raw strings, and Kotlin permit
> interpolation in raw strings.
>
> We can divide the approaches roughly into three categories:
>  - Those that use user-supplied nonces (C++, here-docs).  These can render
> 100% of embedded strings, with the costs that come with nonces: annoying to
> write, and imposing cognitive load to read (as nearly any sequence can be a
> nonce.)
>  - Those that use variable-sized delimiters (Rust, and our previous
> proposal).  These are simpler, but will invariably have some anomalies.
>  - Those that use fixed delimiters (C#, Scala).  These are simpler still,
> and will have more anomalies.
>
> So, recapping our starting point and guidance:
>
>  - The primarily use case is multi-line snippets of JSON, HTML, XML, and
> SQL.  It is rare that these require true-raw-ness, but they all commonly
> have embedded quote characters.
>  - The secondary use case is truly raw strings, of which the most common
> offenders are small-ish -- regular expressions and windows paths.
>  - We should start by trying to extend existing string literals to support
> raw and/or multi-line strings.
>
> Some questions we need to answer:
>
>  - What are reasonable delimiter choices for raw and/or multi-line strings?
>  - Should the default treatment of multi-line strings be raw or escaped
> (alternately, is this one feature or two)?
>  - Is raw-ness a property of a string literal, or a state that can change
> within the literal (i.e., with embedded start-raw/end-raw escape sequences)?
>  - How do we embed delimiters in raw strings (escaping, doubling up,
> concatenation)?
>  - How far do we want to go to support embedding of delimiters?
>
> Let's start by asking how we might extend the current string literal
> feature to support multi-line strings.  Currently, a string literal starts
> with a double-quote, can span only a single line of source, and ends at the
> first unescaped double quote.  How could we extend this to a multi-line
> string literal?  Some possibilities include:
>
>  - Simply remove the constraint of "can only span a single line"; no other
> change to delimiters is required (the Rust approach.)
>  - Choose a different fixed delimiter, such as tripled quotes ("""),
> doubled single-quotes (''...''), or a multi-character quote token
> (`/"..."/`).
>  - Use a modifier on the opening quote, such as `R"..."` or `@"..."`
>  - Use an embedded escape sequence, such as `"\M..."`, to opt into
> multi-line treatment
>  - Use here-docs, with a fixed or user-providable nonce
>
> I think its reasonable to eliminate here-docs from consideration as these
> are more typically associated with scripting languages.
>
> At first blush, the simplicity of the Rust approach is attractive; just
> let strings span multiple lines, with no new syntax.  The obvious
> counter-arguments are pretty weak in the current age; if you code in IDE,
> as most developers do, it is not easy to accidentally leave off a closing
> quote, and the syntax highlighting will make this obvious in the event we
> do so anyway.  But, if we look through the lens of our use cases -- such as
> JSON snippets -- we see that this approach fails almost completely, because
> you _still_ have to escape the quotes, and almost all multi-line snippets
> will have quotes.  So, let's cross this off too.  The same applies to using
> a letter prefix for multi-line strings; it doesn't address the primary use
> case.
>
> Note too that our primary use case admits a middle-ground option:
> multi-line strings are not raw, but quotes need not be escaped.  This is a
> possibility if the delimiter is anything other than a single double-quote
> (`"`).
>
> So, some reasonable starting points on this front include:
>
>  - Just follow C#/Scala/Kotlin, where there's a single mechanism for both
> raw and multi-line, delimited by triple-quotes.  Here, a single (or double)
> embedded quote does not necessarily need to be escaped.
>  - Use triple-quotes for non-raw multi-line string literals, and some sort
> of additional way to select raw-ness for either single- or triple-quoted
> string literals.  (Same comment about embedded quotes.)
>  - Same, but use doubled or tripled single-quotes.
>
> Within the "multiple quote" options, we can separately choose between a
> fixed number of quotes (e.g., 3) or a variable number (e.g., 3 or more, odd
> only, etc.)  The trade-off here is about where the anomalies go; with the
> variable-number approaches, it gets harder to start or end with the
> delimiter character (while this is not necessarily a serious anomaly, but
> it is a prominent one), and with the fixed approach, there is more need to
> do something (escaping, concatenating, etc) the delimiter character (though
> embedding triple-quotes is not all that common in our primary use cases).
> Also, our IDE friends have pointed out that even numbers of quotes put the
> IDE in a quandary as to whether the user has just typed the opening
> delimiter, or both the opening and closing delimiters.
>
> Now, raw-ness.
>
> One option is to just say that multi-line strings are also raw.  We have
> evidence that this is not totally unworkable, as several languages have
> gone this way, but it does mean that for the use cases where the user wants
> multi-line but not raw, they must resort either to concatenation, or
> explicit escape processing (e.g., `"""foo""".escape()`)
>
> Another is to allow a prefix character to indicate raw-ness; `R"foo"` or
> `R"""foo"""`.  The prefix character approach is more extensible to other
> kinds of modes to string processing.
>
> Another option is to use a different delimiter, as the current proposal
> does.  If we were to go this way, I'd suggest we consider double or triple
> single-quote (which are currently illegal), rather than continuing with
> backtick.
>
> A fourth option, one that has not yet been considered, is to say that
> raw-ness is a _state_ of processing a string literal; string literals start
> out escaped, but can drop into (and out of) raw-ness as they like:
>
>     String s = "This part is escaped\n, but this part\- is raw, and this
> part\+ is escaped again."
>     String path = "\-C:\bin\putty";
>
> This gets us where multi-line-ness and raw-ness are orthogonal properties
> of string literals -- without requiring any new delimiters.
>
> So, how to proceed?  First, let's try to avoid focusing on our own
> personal preferences, or be distracted by unfamiliarity, and remember that
> our job here is to get to a design that's best for _tomorrow's_ Java
> developers and source base.  (That means that, for example, we can't allow
> ourselves to be distracted by the fact that, say, embedded "\-" or `R"..."`
> is unfamiliar today.  It will be familiar tomorrow, if we decide that's
> what would be best.)
>
> Here's what would be super-useful:
>
>  - Data that supports or refutes the claim that our primary use cases are
> embedded JSON, HTML, XML, and SQL.
>  - Use cases we've left out, for which we can discuss whether we want to
> incorporate them into our goals.
>  - Data (either Java or non-Java) on the use of various flavors of strings
> (raw, multi-line, etc) in real codebases, which might be useful to help
> determine, for example, whether raw and multi-line should be lumped into
> the same bucket or not.
>
> The bike shed is open (but please show up with structural members, not
> just paint.)
>
>
>