Raw string literals -- where we are, how we got here

Tue Mar 27 19:15:24 UTC 2018

Now that things have largely stabilized with raw string literals, let me 
summarize where we are, and how we got here.

## The proposal

Where we are now is that a raw string literal consists of an opening 
delimiter which is a sequence of N consecutive backticks, for some N > 
0, a body which may contain any characters (including newlines) except 
for a sequence of N consecutive backticks, and a closing delimiter of N 
consecutive backticks. Any line-end sequences (CR, LF, CRLF) are 
normalized to a single newline (LF), and the remainder of the body is 
treated without any further transformation (including without unicode 
escape processing), and placed in a String.  No other processing is done 
on the contents.

A raw string literal has type String, just like a traditional string 
literal, and can be used anywhere an expression of type String can be 
used (assignment, concatenation, etc.)

Examples:

     String s = `Doesn't have a \n newline character in it`;
     String ss = `a multi-
         line-string`;
     String sss = ``a string with a single tick (`) character in it``;
     String ssss = `a string with two ticks (``) in it`;
     String sssss = `````a string literal with gratuitously many ticks 
in its delimiter`````;

Note that the delimiter need not be _more_ ticks than the longest tick 
sequence in the body; if the body contains sequences of two ticks and 
three ticks, it can be delimited by one tick, four ticks, five ticks, 
etc.  This makes it possible to choose a minimal delimiter that doesn't 
interfere with the body.

## Design Center

The design center for this feature is _raw string literals_.  Not 
multi-line strings (though this is well handled), not interpolated 
strings (though this can be considered in the future.)  It turns off all 
inline escaping, even unicode escaping (which is usually handled by the 
lexer before the production even sees the characters.) We stay as true 
as we can to this principle: raw means raw, not 99% raw with a little 
bit of escaping.  (The single exception is normalizing of carriage 
control, the absence of which would just be too surprising.)

The primary use case addressed by raw string literals are snippets of 
code from other languages embedded in Java source files.  Here we 
interpret "languages" broadly; they could be traditional programming 
languages, specialized languages like regular expressions or SQL, or 
human languages.  We want that the Java lexing not interfere at all; 
given a suitable O(1) incantation (picking a non-conflicting delimiter), 
you can freely cut and paste the foreign string to and from Java.  Being 
able to do this is not only convenient, but it reduces errors due to 
hand-mangling the string, and enhances readability because the embedded 
snippet is free of interference from Java.

Choosing raw-ness as a design center leads to a simpler design, which is 
good, but it also is _more stable_, because it leads us away from the 
temptation to tweak the rules here and there in ways that might be 
subjectively attractive, but that further increase the complexity of the 
feature.  This design choice belies a priority choice: the high-order 
bit is _no embedding anomalies_. Users don't have to reason about 
whether they need to hand-mangle a snippet to avoid it being mangled by 
the compiler or runtime; given a suitable choice of delimiter, there's 
nothing else to think about.  (IDEs can help with the "writing code" 
part of this.)

The various additional features we might be tempted to put in (special 
processing for leading or trailing blank lines, leading white space, 
trimming to markers, etc) can instead be handled via library 
functionality.  Since raw string literals are Strings, we can further 
process them with library code -- both JDK code and user code (though 
methods on String have the advantage that they can be chained, rather 
than wrapped, which most users will prefer).  Adding new string 
manipulation features via libraries rather than through the language is 
easier, can be done by users, and is not constrained by the demands of 
consistency (you can have seven different trimming methods, each with 
their own definition of whitespace, if you like), whereas a language 
feature has to be one-size-fits-all.  Moving this complexity to the 
library where possible leads to a simpler feature and more choices for 
users.

#### A road not taken

We choose to divide the world of string literals first into raw and 
non-raw literals; from this, multi-line strings falls out for free as we 
can treat line breaks in the source file as just more raw characters.

We could have chosen, instead, to first divide the world into single and 
multi-line strings, and then into raw and non-raw; this would have left 
us with four choices (raw single line, raw multi-line, cooked 
single-line, cooked multi-line.)  This also would have been a defensible 
position, but seemed to add lexical complexity for little gain.

#### The exception that proves the rule

The one exception to raw-ness is that we normalize the line terminators 
to the most common (*nix) choice of a single newline, rather than using 
the platform-specific line terminator on the system that happens to have 
compiled the classfile.  The alternative would have just been too 
surprising.

## Syntax

Given that this feature has such a high syntax-to-substance ratio, we 
should expect more than the usual number of syntax opinions. Let's start 
with some consequences of our chosen design center.

#### No fixed delimiter

 From the design choice above, it is a forced move to accept variable 
delimiters.  Otherwise, one cannot represent a string with the delimiter 
in a raw string, without inventing an escaping mechanism, and subverting 
our "raw means raw" goal.

The "self-embedding test" is not a mere theoretical goal.  Since the 
snippets we expect to paste into Java source are not randomly chosen 
strings of characters, but meaningful snippets of some language, the 
likelihood of wanting to represent a string that contains the chosen 
delimiter goes up.  Even if you are willing to dismiss "embed Java in 
Java" as a serious use case (we're not), people also want a familiar 
delimiter, which means something that looks like the delimiter in other 
languages, further increasing the chance of collision.  (For example, if 
we'd picked a fixed triple quote delimiter, then you couldn't embed 
Groovy or Python code, among others -- surely a real use case).  Fixed 
delimiters (of any length) and "raw means raw" are not compatible goals, 
and we choose "raw means raw".

The credible options for variable delimiters are using a repeating 
delimiter sequence (say, any number of ticks), or some sort of 
user-provided nonce ("here" docs), or both.  Nonces impose a higher 
congnitive load on readers, and their benefit accrues mostly to corner 
cases, so the more constrained option of repeating delimiters seems 
preferable.

#### Why not 'just' use triple quotes

People's syntax preferences are guided by familiarity, so we should 
expect suggestions to be biased towards what "similar" languages already 
do.  So the suggestion of using """triple quotes""" should be expected.

We've already discussed how a fixed delimiter is not acceptable. So at a 
minimum, this would have to be adjusted to "three or more."  While some 
people find triple quotes natural (or at least familiar), others find it 
offensively heavyweight.  Neither crowd is going to convince the other.

#### But ticks are too light

The opposite of the "triple quotes are too heavy" argument is "ticks are 
too light"; that a single tick is a lightweight character, and could go 
unnoticed, especially if your monitor hasn't been cleaned for a while.  
Unfortunately the quote-like delimiters in the middle of the weight 
range are taken by other activities.  Again, we can't satisfy the "too 
light" and "too heavy" crowd at the same time; whichever we do will make 
some people unhappy.

#### Why do you have to always do something new?

The quoting scheme chosen -- any number of ticks -- is actually taken 
from something we all use: Markdown 
(https://daringfireball.net/projects/markdown/syntax), which permits any 
number of ticks to be used for infix sequences, and any different number 
of ticks to be embedded.  (Where we depart from Markdown is that 
Markdown strips any leading and trailing newlines from multi-line tick 
blocks, an appropriate trick for a page presentation language, but not 
consistent with the design goal of "raw".)

#### But I want indentation stripping

When embedding a snippet of one language in another, both of which 
support indentation, we are left with two choices: indent the enclosed 
block exactly, which has the effect of the code "jutting out to the 
left", or indent the enclosed block relative to the enclosing block, 
which has the effect of having more indentation than you might want for 
the enclosed block.  Sometimes this doesn't matter, but sometimes it 
does. Whatever we do, one of these crowds will be unhappy.  When in 
doubt, we stick to the principle of "raw means raw", and provide 
indentation stripping via new instance methods on `String` to allow a 
range of trimming options, such as `trimIndent()`.

#### But I want leading / trailing empty lines

Some people would like for the language to strip off leading and 
trailing blank lines.  Like indentation stripping, this is going to be 
what people want sometimes, and sometimes not.  And given that again, we 
can't do both, we again, are guided by "raw means raw", and provide 
library means to strip the extraneous newlines.

#### But I want a marker character to make it obvious

Some people would like a margin marker character, so they can manage 
margins like this:

     foo(`This is a long string
         >the characters up to, and
         >including, the bracket are stripped
         >by the compiler
         >    and this line is indented`)

(Others would argue the marker character should be "|".)  Again, we 
believe these sorts of transforms are the purview of libraries, not 
language, and will be provided.

#### But people will make ASCII art

     ``````````````````
     `Yes, they might.`
     ``````````````````

#### But I want to use unicode escaping

There will be library support for explicitly processing Unicode escape 
sequences, or backslash escape sequences, or both.

#### But calling library methods like `longString`.trim() is ugly

You say ugly; I say simple and transparent.

#### But doing these things in libraries has to be slower and yield more 
bloated bytecode

No, it doesn't.

## Anomalies and puzzlers

While the proposed scheme is lexically very simple, it does have some at 
least one surprising consequence, as well as at least one restriction:
  - The empty string cannot be represented by a raw string literal 
(because two consecutive ticks will be interpreted as a double-tick 
delimiter, not a starting and ending delimiter);
  - String containing line delimiters other than \n cannot be 
represented directly by a raw string literal.

The latter anomaly is true for any scheme that is free of embedding 
anomalies (escaping) and that normalizes newlines.  If we chose to not 
normalize newlines, we'd arguably have a worse anomaly, which is that 
the carriage control of a raw string depends on the platform you 
compiled it on.

The empty-string anomaly is scary at first, but, in my opinion, is much 
less of a concern than the initial surprise makes it appear. Once you 
learn it, you won't forget it -- and IDEs and compilers will provide 
feedback that help you learn it.  It is also easily avoided: use 
traditional string literals unless you have a specific need for 
raw-ness.  There already is a perfectly valid way to denote the empty 
string.

#### Can't these be fixed?

These anomalies can be moved around by tweaking the rules, but the 
result is going to be more complicated rules and the same number (or 
more) of anomalies, just in different places -- and sometimes in worse 
places.  While there is room to subjectively differ on which anomalies 
are worse than others, we believe that the simplicity of this scheme, 
and its freedom from embedding anomalies, makes it the winner.

Because we start with such a simple rule (any number of consecutive 
ticks), pretty much any tweak is going to be complexity-increasing.  It 
seems a poor tradeoff to make the feature more complex and less 
convenient for everyone, just to cater to empty strings.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/amber-spec-experts/attachments/20180327/4f60666f/attachment-0001.html>