String reboot (plain text)
John Rose
john.r.rose at oracle.com
Sat Mar 16 00:54:30 UTC 2019
OK, I responded to one corner by pointing out a principle that tends to
align rawness more strongly with multi-line-ness. I guess I should lay
all my cards on the table FTR, and will do so by responding to Brian's
restacking Email and Jim's reboot Email. (I guess today's String-day.)
TL;DR: I agree substantially with Jim's analysis and Brian's staging,
especially the earlier and simpler parts.
Our order #1 should keep classic escapes, instead of eliminating them (raw)
or strengthening them (strong escapes, like strong delimiters). Later orders
should have a place for such things (raw and/or strong escapes/quotes).
(Side note: The term "escape" always make me think of a two character
sequence, the first of which is probably reverse solidus, like "\x".
I'd like to use a neutral term like "interruptor" coupled with "quote" to refer
to the more general feature of "a visible notation which interrupts a string
rather than terminates it like a quote does". And now I realize that Jim's
term "delimiter" does the same thing for "quote". So I'll try to tilt toward
"delimiter" and "interruptor" instead of "quote" and "escape".)
Classic escapes and single quotes are both too tiny to see well inside multi-line
strings, but they are also familiar and people will get used to "squinting" for them,
at least the escapes. Our take is that we'd all rather "squint" (in the first order)
instead of add complexity to the first feature.
I'm fine with a two- or three-order stacking, as long as there is a credible
story for the final course of the meal, if we are still hungry, which includes
strong delimiters and (some sort of) strong escapes that are (a) not easy to
collide with and (b) not hard to "squint" for. IMO strong delimiters will often
be associated somehow with strong interruptors. In fact (see digression
below in context) I think rawness is maybe not exactly the right concept;
the concept of "escape strength" may be more fruitful for us.
> On Mar 13, 2019, at 10:52 AM, Brian Goetz <brian.goetz at oracle.com> wrote:
>
> Lots of good discussion so far. Let me gather the threads.
>
> - The primary use case is embedding multi-line chunks of foreign code or data in Java, with minimal need to cruft it up with escaping. This says to me that _multi-line strings_ are actually the high-order bit here, and raw strings are the next bit. Let’s address these in order.
+1
> - Multi-line-ness and raw-ness are orthogonal concepts. Some languages merge them, and we might consider doing that too, but we shouldn’t start there.
+0.6
(As I implied previously, a number less than one is more representative of
orthogonality, sine-of-the-angle-between, of the two features.
But also, I'm fine with not starting with raw-ness, as long as it's on the
menu somewhere.
> - For multi-line strings, a stronger delimiter (e.g., """) seems to be preferred on readability grounds, because people don't want to have to squint to see where the embedded code ends and the Java code resumes.
Yes. The same point applies to escapes ("string interruptors", not "string delimiters"),
but since escapes are clearly less common than string boundaries, I'm content to
just note the point, and accept a design which requires users to squint for escapes,
on the grounds that they will be both rare, usually safe to disregard on first reading.
> To which I'll add the following observations:
>
> - Most multi-line string candidates (JSON, XML, SQL, etc) do not require characters that have to be escaped, as long as we don't have conflicts with the quote character. Which suggests further than ML-ness and raw-ness are solving separate problems.
Jim notes this in passing in the "75%" section, but I'll call it out here too:
"Characters that have to be escaped" also include Java's escape. A JSON
string will have a puzzling problem if it contains a JSON escape sequence that
is processed by Java, rather than by the JSON parser. I don't see how to avoid
this easily in the first course on the menu, but I want to note the design
heuristic that design vectors for delimiters are correlated with interruptors.
(The problem with JSON escapes is like the problem with regexp escapes.
In both cases we have both Java and the foreign notation competing for
ownership of the reverse solidus. I think a proper notion of strong interruptors
will allow Java to gracefully give the foreign notation precedence, within
certain of Java's envelopes, just as strong delimiters do so with quotes.)
If you have to escape foreign delimiters, chances are you'll have to escape foreign
interruptors. Another use of the heuristic: If you found yourself tripling the quotes
to avoid collisions, there's probably a related use case for strengthening
(tripling???) the escapes, to avoid the same (but rarer) sort of collisions.
(I'm thinking Python also and JavaScript also, for script fragments, but we choose
to place scripting lower on the menu, along with quoted-Java-in-Java nesting.)
> - Once we separate multi-line from raw, the idea of automatically reflowing indentation starts to become a sensible option on non-raw, multi-line strings.
+100 Yes, this is the nugget of gold that we mine out of the decision to defer rawness.
> - Repeating delimiters are slightly more powerful than fixed delimiters, but also have additional cognitive load, and can still lead to anomalies that are easily encountered.
That said, they pay for themselves as visual cues for multi-line thingies, and we
immediately put them back into the shopping cart, with length set at three.
This helps us properly size the "cognitive load" argument. Once you learn about
jumbo delimiters, you learn to spot them, and you are paid for the effort because
you only learned once, but you can spot them quicker every time you look.
The same point readily applies to replacing "a count of three" with "a count of
three or more", although with sharply diminished returns, since three is almost
always enough. (What about quote counting? Well, programmers shouldn't be
writing puzzlers in their code. So use extra, enough to make it obvious, and don't
trick your reader with one-off counts unless you are writing a puzzler book.
Or find another solution instead of quote counting to make the quotes look
(a) like the quotes they are, and (b) different enough from competing would-be
quotes.)
None of these ideas apply to the first course, IMO. I'm realizing how apt it is
for Jim to call it an appetizer; it is very thin but tasty, as an appetizer should be.
And Brian will say, "wait until you see how filling it is!" We certainly want to avoid
unhealthy gorging…
> With that said, let's reorder the dishes a bit.
>
> For our first course, we could have multi-line strings, delimited by the fixed delimiter """. These would be escaped strings, just like existing string literals, but because the single-quote is no longer the delimiter, the most common source of escaping (embedded quotes) is removed. Most multi-line strings will require no escaping at all.
+1 (for most definitions of "most")
> Note that if we stopped here _and never ordered anything else_, we would still be in a much better place than we are now (most snippets could just be cut and pasted without mangling), and what we've introduced is dead-simple! So the cost-benefit ratio here is high; it’s a simple addition that addresses a significant fraction of the pain points. I think we should at least order this.
+100
> Now, maybe we're still a little hungry, and the above doesn't help with those strings that are most polluted by escapes, such as regular expressions. So, we might additionally order the ability to layer a way to say "no escape mangling" atop both our " strings and our """ strings. Jim proposes we use a delimiter of \".."\ for such strings (\""" ... """\ for the multi-line version). This has a nice connotation; it is as if the backslash is “distributed over” the whole string.
+1; it wins the beauty contest.
It needs lack of simplicity as well as beauty. By simplicity I mean
it resists unintentional creation of puzzlers, and we think intentional
puzzlers have a limited effect. The jury is out IMO; puzzle on.
Also, the second course (tweaking escapes) needs IMO to be plausibly
followable (if not followed in fact) by a third course, which allows fullest
control of syntax (nonces, repeats, whatever). I think Jim's syntax passes
that test, since there are ways to increase the number of escapes, or
lengthen the token in other ways to achieve strong delimiters. It seems
to me there may be a good course #3 design which pins the quotes
at three and allows larger and larger numbers of escapes.
(Hmm, idea of the moment: We could allow any *whole* delimiter
sequence to be *tripled* in order to strengthen it. Not just little old
double-quote " gets the tripling treatment. But now I'm puzzling way
outside the box.)
> This does, unfortunately, bring us back into Delimiter Hell; what if we want our string to contain the quote + backslash combination? One way is to dive back into repeating delimiters (e.g., using multiple backslashes in the delimiter). Having a non-homogeneous repeating delimiter leaves us in a slightly better place than the original proposal, as we’ve eliminated the “empty string” anomaly as well as the “starting with backtick” anomaly. So this seems a workable direction, though the cost-benefit here is less than with the first course — in both directions (higher cost, lower benefit.)
>
>
> So, in the spirit of “keep ordering until sated, but stop there”, here are some reasonable choices.
>
> 1. Do multi-line (escaped) strings with a “”” fixed delimiter. Large benefit, small cost. Most embedded snippets don’t need any escaping. Low cost, big payoff.
>
> 1a. Do 1, but automatically reflow multi-line strings using the equivalent of String::align. There have been reasonable proposals on how to do this; where they fell apart is the interaction with raw-ness, but if we separate ML and raw, these become reasonable again. Higher cost, but higher payoff; having separated the interaction with raw strings, this is more defensible.
I like this; it will make ML-string code more readable, and coders can use
indentation to guide the eye. This almost (not quite) removes the need for
tripling the quote. (Not quite because it would mandate indentation, and
because of JSON quotes. Heuristic comment: Remember JSON escapes also.)
1a'. As part of 1a., add a one or two new escape sequences to control
string body layout, in straightforward ways, as part of the reflow story.
Discussion on request; one way is to allow a "white space gobbler" escape
which eats the backslash and all whitespace plus a final newline if any.
I'm mentioning that now here because it has several uses.
> 2. Do (1) or (1a), and add: single-line raw string literals delimited by \”…”\.
This course (#2) raises the issue of controlling delimiters and interruptors separately
instead of together. I think it's fine to control them separately, in different courses.
If quote and escapes (delimiters and interruptors) were equally common in today's
workloads I think we'd choose to control them together, but they are not, so it's
more important to tweak the delimiters than tweak the interruptors.
This proposal can be understood in either of two ways: The contents of the string
are absolutely raw except for the occurrences of end-delimiters, or they are "more
strongly raw", in that some stronger interruptor is sufficient to bring in today's
rules for escapes, just as some stronger delimiter is sufficient to delimit the
end of the string.
I think Jim anticipated the idea of stronger interruptors when he said:
> Even with escaping off, we still might have to escape delimiters.
> Repeated backslashes (or repeated delimiters) is the typical out.
The idea of stronger escapes conflicts with absolute "escaping
off", which Jim also calls for, so I think order #2 needs a little
more simmering. Which is fine; let's eat order #1 first.
My overall take is, if a strong-enough (repeated?) escape can escape a
strong delimiter, let's also allow such a strong-enough escape to do
other chores as well; that leads me to a proper concept of "strong
interruptor". This means that if you have a raw string that has a very
rare need for an escape sequence, then you just strengthen the escape,
rather than cook the whole string or concatenate it. Use the right rawness
for the job, certainly, and maybe there's a way to do this on the whole-string
level. In any case I think we can improve here on the previous proposals for
"regional rawness". More details later; that's enough for now.
<digression>
Rawness is proportional to escape strength.
No single string syntax is truly 100.000% raw, because the raw string
cannot include a copy of its delimiter. Adjust that viewpoint to embrace
interruptors as well and you get: A very raw string is one which is difficult,
but not impossible, to end with a delimiter token, or to interrupt with an
interruptor token. What does "difficult" mean? Simple, it means using
more characters, until the subject string gives up and says, "don't have
one of those, go fish".
So the quest for ever stronger delimiters has a flip side: It is also a quest for
ever rawer string notations. There is no such thing as an absolutely raw string,
just one that is "raw enough". In those terms, I'd like to reserve, for an
optional final course, a scheme for making strings as raw as you please,
so that a quoted-and-escaped-five-times-raw string can be quoted inside
of quoted-and-escaped-six-times-raw string. A corner case for purists?
Yes. A real need for real users? We'll see; let's keep something brewing in
the kitchen, just in case.
</digression>
> 2a. Do (1) or (1a), and also support multi-line raw string literals (where we _don’t_ automatically apply String::align; this can be done manually). Note that this creates anomalies for multi-line raw string literals starting with quotes (this can be handled with concatenation, and having separated ML and raw, this is less of a problem than before).
+1
If we allow stronger interruptors in rawer strings, we can easily disrupt would-be
delimiters by escaping them, so we wouldn't need concatenation. The stronger
escapes could be part of 2 (controversially complex) or 3 (slightly inconsistent
with absolute rawness of simple 2 syntax).
> 3. Do (2) and (2a), and also support a repeating compound delimiter with multiple backslashes and a quote.
>
> Note that we can start with 1 or 1a now, and move on to 2/2a later, and same for 3.
Order #3 is where we would have a full and decisive conversation about not
only strong delimiters but also strong interruptors. I bring it up with order #2
above because #2 is where interruptor control first appears as a possibility.
> As we evaluate these options, note that:
>
> - Having separated ML-ness from raw-ness, doing automatic reflow becomes more defensible for the common (ML, non-raw) case.
This is a very important point. It wasn't apparent when we started, and that's
why we go slowly on these things.
> - The intersection of ML and raw seems pretty small, so doing 1a + 2, while asymmetric, is defensible.
Our experience will bear out how truly small this intersection is; you and I perhaps
differ on that call. But after doing 1a (1a' please!) we will certainly know more.
> - What we don’t order now, we can add later.
Yes, if we are careful not to get ourselves thrown out of the restaurant
by making poor choices during the early courses. That's why I'm being
all picky and theoretical here.
Now for some brief responses to Jim's points, if they are not already
noted above:
On Feb 10, 2019, at 7:43 AM, Jim Laskey <james.laskey at oracle.com> wrote:
>
>> ...50% solution
>>
>> Where we keep running into trouble is that a choice for one part of the lexicon spreads into the the other parts. That is, use of certain characters in the delimiter affect which characters require escaping and which characters can be used for escaping.
(Good insight; leads to independent control for delimiter.)
>> ...
>>
>> 75% solution, almost
>>
>> …
>> • Even with escaping off, we still might have to escape delimiters. Repeated backslashes (or repeated delimiters) is the typical out.
(Yes, this got me going, maybe more than you intended, see above.)
>>
>> String html = \"<html>
>> <body style="width: 100vw">
>> <p>Hello World.</p>
>> </body>
>> <script>console.log("\nloaded")</script>
>> </html>"\;
(I'm starting to call these Jim-quotes. They are growing on me.)
>> … Captain we need more sequences.
>
>> And, this is the crux of all the debate around strings. Fixed delimiters imply a requirement for escape sequences, otherwise there is content you cannot express as a string.
(My work is almost done here! Now if we apply that reasoning to
interruptors also, we get the idea of adjustable rawness, without
losing the benefits of escape sequences.)
>> ...
>> Fixed delimiter
>>
>> If we go with a fixed delimiter then we limit the content that can be expressed without escape sequences. This is not totally left field. There are floating point values we can not express in Java and types we can express but not denote, such as anonymous class types, intersection types or capture types.
(Sure, but strings are much more "free" mathematically than those other things
One character shouldn't have to care (char?) what its neighbors are doing.)
>> ...
>> Once you take away conflicts with the delimiter, most strings do not require escaping.
…Always excepting strings which have the audacity to mention
the New, Improved Delimiter. If Java picks one that nobody else
would ever dream of, we'll still have one remaining case of
embedding Java inside of Java. For me failure to nest is a smell
indicating possible rats, for others it's a trade-off.
>> …
>> Summary: All strings can be expressed with fixed plus escaping, but can not express strings containing the fixed delimiter (""") with escaping off.
True. Three points related to that:
A. If we have to escape the fixed delimiter, then we place an escape
before it, and all is well. If we are happy that users can easily spot
our delimiter without "squinting", then they can probably spot the
escaped copy of the same delimiter.
B. But, once we allow delimiters to run through the string, there is
another cost: Little sequences like \\ and \n and \0 can be anywhere
in the bulk of the ML string, and users *must squint* for those.
This is a cost, and we wish we could make those more visible also,
or just make the rest of the string raw.
C. The observations of A and B can be balanced if we use strong
interruptors instead of the "little squinty sequences", and maybe
also for the escaped delimiter. There are various ways to do this,
all of which suppress short escape sequences in favor of longer ones.
>> Jumping ahead: I think that stating that traditional " strings must be single-line will be a popular restriction, even if it not needed. Then they will think of """ as meaning multi-line.
+1
>>
>> Structured delimiter
(AKA periodic or partially periodic string.)
>> …
>> Summary: Can express all strings with and without escaping. If the delimiter length is limited the there there is still a (smaller) set of strings that can not be expressed.
Yep. And put "structured interruptor" in the kitchen also.
>> Nonce delimiter
>>
>> ...
>> Summary: Can express all strings with and without escaping, but nonce can affect readability.
I agree. There's too much "noise" in a nonce, and it's easy to misuse.
Alternative (stated elsewhere): Indexed delimiter. Here, the role of the nonce is
played by a small number which is not the length of the delimiter but rather an
actual numeral placed in the delimiter. Such things can be made deterministic,
so that, if you are going to quote a string S which has apparent delimiters in it,
there is a unique smallest non-conflicting index which may be used for the
indexed delimiter of the quoted string. (And the indexed interruptor, if you
want one.)
>>
>> Multi-line formatting
>>
>> I left this out of the main discussion, but I think we can all agree that formatting rules should separate the delimiters from the content.
+1 (This is an instance of user control over the form of the source program
containing the string. I don't know what is the right mix of mechanism and
policy to get it all right, but I agree format control is an important issue.)
>> Other details can be refined after choice of delimiter(s).
>> ...
>> Entrees and desserts
>>
>> If we make good choices now (stay away from the oysters) we can still move on to other courses later.
>>
>> For instance; if we got up from the table with the ", """, ", """ set of delimiters, we could still introduce structured delimiters in the future;
This is often true, but not always, so we have to keep our eyes open.
Purely periodic strings don't extend, as structured delimiters, as well
as non-periodic or (some) partially-periodic ones. Consider:
var s = \"""""…
Does that begin today's three-quote-delimited string, which has two more
quotes in it, or tomorrow's five-quote-delimited string? (This takes me back
to the crazy idea of going with 1, 3, 9, 27 quotes. "I'll have a triple.")
If I allow up to N quotes in my delimiter today, then coders will write strings which
begin with more quotes in the string body. Either I have to somehow outlaw that,
or else I am forbidden from using longer strings of N+1 quotes for future delimiters.
Adding more escapes on the front is another matter, and I think that would work
fine, especially if the "extra" escapes on the front somehow strengthened the
string's interruptor and delimiter in a consistent manner.
So we could enumerate ", \", """, \""", \\""", \\\""", \\\\""" etc.
Or ", \", """, \""", \1""", \2""", \3""" etc.
No need for more than three quotes (or more than one, for that matter,
but there are other reasons to like three).
>> either with repeated (see Swift) or repeated ". We could also follow a suggestion John made to use a pseudo nonce like " for \\" or """"".
Yep, see above.
>> Point being, we can work with a 85% solution now that we can supplement later when we're not so hangry.
+100
HTH
— John
More information about the amber-spec-experts
mailing list