Raw string literals -- restarting the discussion

Brian Goetz brian.goetz at oracle.com
Mon Jan 7 16:36:05 UTC 2019


>> From: elias vasylenko <eliasvasylenko at gmail.com>
>> Subject: Re: Raw string literals -- restarting the discussion
>> Date: January 7, 2019 at 8:10:57 AM EST
>> To: amber-spec-comments at openjdk.java.net
>> 
>>> At first blush, the simplicity of the Rust approach is attractive; just
>> let strings span multiple lines, with no new syntax.  The obvious
>> counter-arguments are pretty weak in the current age; if you code in IDE,
>> as most developers do, it is not easy to accidentally leave off a closing
>> quote, and the syntax highlighting will make this obvious in the event we
>> do so anyway.  But, if we look through the lens of our use cases -- such as
>> JSON snippets -- we see that this approach fails almost completely, because
>> you _still_ have to escape the quotes, and almost all multi-line snippets
>> will have quotes.  So, let's cross this off too.  The same applies to using
>> a letter prefix for multi-line strings; it doesn't address the primary use
>> case.
>> 
>> I'm a little confused about the argument to cross this off. Is this not
>> dismissing a solution to the multi-line string problem on the basis that it
>> doesn't also solve the raw string problem? Within the exploration of raw
>> strings and multi-line strings as separate features I think this reasoning
>> bears a little extra scrutiny.

I think you’re confusing the Rust multi-line syntax with the Rust raw syntax.  What was crossed off here is the choice to simply let single-quoted string literals span multiple lines.  

“Raw” is not a very well defined term, and the degree of “raw-ness” in so-called raw strings varies dramatically across languages.  So instead, we tried to frame this in terms of use cases.  The most important use case here is: embedded snippets of HTML,JSON,SQL, or XML.  And the “just let single-quote strings cross lines” approach fails dramatically here, because these are all expected to have many embedded double-quote characters.  

>> I'd argue that the requirement for
>> unescaped quotes falls more naturally within the scope of the raw string
>> feature than the multi-line string feature:

I understand why you would make this argument — initially, we fell into this subjective interpretation of “raw" as well.  But if you dig deeper, you’ll see that the reasons why various characters need escaping varies.  There are at least three:

 - Concerns over representation in source (tabs, newlines, non-ascii characters)
 - Concerns over conflict with the escape mechanism (backslash)
 - Concerns over conflict with the delimiter (quotes)

Only the first really belongs in the province of raw-ness; we’d not be concerned about quotes if our delimiter was something other than a quote. When you change the delimiter, the need to escape quotes goes away.  

And, if you observe actual usage, you’ll see that quotes show up considerably more often in multi-line strings than other characters that might want escaping.  A mechanism that supports spanning lines and quotes is actually what most users need most of the time. 

In any case, I think this is the essence of your comment: that you think that quoting should be handled as part of raw strings, not multi-line strings.  Which is a fine perspective, but if you follow it a little further, you get to “multi-line stings are mostly useless” (for reasons already explained), at which point you get to “let’s just merge the features” (as some languages have chosen to do.)  Which is also a possibility here.  

>> That said I think there's also a minor danger of implying some sort of
>> distinction between nonce-based delimiters and variable-length delimiters
>> which doesn't necessarily exist. Isn't the latter just an example of the
>> former but with a restricted format?

Again, it depends whether you’re asking parsers or humans.  From a grammar perspective, a variable-length delimiter is mostly just a restricted nonce.  But from the perspective of humans who have to read code that includes human-generated nonces, the perception and cognitive load are quite different.  

I say “mostly”, though, because if you restrict the form enough (such as “any number of backticks”), you start to implicitly exclude some representable strings, such as those that start with backticks.  So now you’re paying both the complexity price of a more complex delimiter, and not even getting the benefit of being able to represent all strings.  Which was a balance that made us reconsider the previous proposal.




More information about the amber-dev mailing list