Raw string literals and Unicode escapes
John Rose
john.r.rose at oracle.com
Mon Feb 26 20:17:13 UTC 2018
On Feb 26, 2018, at 10:43 AM, Alex Buckley <alex.buckley at oracle.com> wrote:
>
> On 2/25/2018 4:19 AM, Remi Forax wrote:
>> I'm late in the game but why not using the same system as Perl, PHP,
>> Ruby to solve the Lts [1], i.e
>> you have a sequence that says this is the starts of a raw string (%Q,
>> qq, m) then a character (in a predefined list), the raw string and at
>> the end of the raw string the same character as at the beginning (or its
>> mirror).
>>
>> By example, this 'raw' as prefix for a raw string
>> raw`this is a raw string`
>> raw'this is another raw string'
>> raw[yet another raw string]
>
> See "Choice of Delimiters" in the "Alternatives" section of the JEP.
The JEP doesn't clearly call out the goal of *no* escapes in the bulk
of the raw string, but that requirement (which we have adopted)
affects the choice of quotes in a decisive manner. Let me try to
lay out the "string physics" that underly this decision.
*Any* single-character end-quote will have a significant probability
of showing up inside the bulk of a (randomly selected) raw string.
How significant? Well, let's say conservatively that raw strings
can have all possible characters, but the end-quote sequence
only shows up one out of a hundred times, per character position,
in raw strings. If you are using a series of ten-character raw
strings (to say nothing of bigger ones), you have about a 10%
chance for any given raw string to contain an inconvenient
end-quote.
That percentage is significant, especially given that in some
cases strings will be longer and quote characters will be more
common, both factors increasing the failure rate beyond 10%.
But even a 0.1% failure rate is noticeable to users, making a
feature feel unreliable.
This generalizes to any fixed multi-character end-quote, with a
reduction of probability exponential in the length of the end-quote,
but still with a non-zero probability, of occurring in the bulk of
a randomly selected string. A two-character end-quote might
have a probability of 10^-4, and that means you have a more
modest but still significant chance of failure of 10% across a
suite of 100 random 10-character strings, or for one random
1000-character string.
Any *finite choice* of end-quotes has the same problem, with
a non-zero probability that decreases (but does not vanish)
with the number of available end-quotes. The only way to
break out of the box is to allow the user an unlimited range
of successively "stronger" end-quotes (i.e., less likely ones).
(Randomly selected raw strings are easy to model, although
the numbers used above are an approximation to a binomial
distribution. In fact, though, strings which show up non-randomly
in real code are *more* likely to mention end-quotes, since their
contents are somehow correlated to the enclosing language.)
You can easily demonstrate this issue by nesting Java code
which uses raw quotes inside of a containing raw quote. An
easy first test of a proposed quoting mechanism is, "will it
nest?" If not, then the quoting mechanism does not meet
a key requirement for raw quotes.
This key requirement is unconstrained pasting *without* fixups
(escape sequences embedded in the bulk of the quote).
Anything else, with some epsilon probability of requiring escapes,
is not truly raw, just "mostly raw".
In the case you propose, Remi, the probability of having an
un-quotable bulk string is quite high, since all of the end-quotes
are single characters.
Only a convention with an end-quote of arbitrary length is strong
enough to "fence in" arbitrary raw strings. The simplest possible
such convention is to allow replication of a single character to
serve as the end-quote. This decision toward simplicity
influences other features in Java raw strings, including the
decision to use a new character and to diasallow certain
edge cases, notably null strings.
— John
P.S. I expect IDE vendors will quickly supply useful "stretchy quotes"
which will resize themselves to contain whatever users throw into
the raw string body. At that point backticks will feel like magic tokens
that never accidentally match raw string bodies.
More information about the amber-spec-experts
mailing list