[ string literals ] Extending the escape language (was: String literals: some principles)

Tue May 7 23:36:21 UTC 2019

On May 7, 2019, at 3:14 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
> 
>> TL;DR: Good framework; must also account for the
>> rectangle extraction rule (RER).  A unified escape
>> sublanguage (ESL) is highly desirable, and I propose
>> adding <\ > and <\ LT WS*> as escapes for space
>> and for null string.  The existing \ char is OK, and
>> should be "fattened" as a separate feature.  I note
>> some issues with <\ u X X X X>.
> 
> Agree in general with the desire to extend ESL with some whitespace sequences, though I take some issues with the syntax on \<nl> and \<space>.  Some alternate ideas regarding \uxxxx.  
> 
> First, unicode escapes.  Alex pointed out offline that we had worked our way into a linear thinking trap (again).  In the first round, because we were focused on raw strings, we turned off \uxxxx processing in the body of a raw string, which raised the question of “how do we turn it back on.”  And also that, while we use the same escape character for both, they occupy very different places in the language; the ESL is purely about string literals, whereas \uxxxx is purely a lexing concern.  

I don't think that's the trap we are in.  The trap is
the Language Experts Designing User Model trap,
where LE's say "we don't need to deal with \u because
it's not the part of the JLS we are working on",
and the user says, "they are all just escapes, right?"
The reason it's a trap is we think the user will be happy
to learn and apply the geeky-fine distinctions between
the two superficially similar syntaxes.

One good way out of this particular trap is to
carefully restrict the allowed \uxxxx patterns
in strings, so that the phase order becomes
irrelevant, and then move those patterns
forward in the phase order along with the
other escapes.

We can also do as you are recommending, and
ignore the problem.  The only difficulty there
is occasionally having to ask the user to ignore
the problem also, by saying things like "yes,
that's an escape sequence but \u sequence
break the rule you are trying to apply".
Such as using "\0040" to escape a space.
How frequent is "occasionally"?  I don't know;
if it's very infrequent then, yes, we can ignore
this problem.  It will give puzzler authors some
extra scope for their hobby.

> His recommendation, which (now that its been explained to me) I strongly agree with, is: let’s not have this feature touch unicode processing at all.  Let’s just leave unicode processing as is, using \uxxxx, whether in code, SLSLs, MLSLs, and any future “raw” SLs.  The similarly between \n and \uxxxx is purely coincidental.

(That's why it's a LEDUM trap.)

> And if we really want the characters "\u0000” in a string literal, well, we know how to escape the \.  
> 
> Which brings us to \<eol> and \<space>.  My main complaint here is that I am really uncomfortable using \<space> for “literal space”, because at the end of the line, one cannot differentiate between \<eol> and \<space> when reading the code.  Alternatives include \_, or \s, or \., or … many others.  

Personally, I'm fine with those.  By analogy with \n I
suppose \s will be unsurprising; I don't care about
this corner of the bikeshed, though.  I certainly agree
that having more than one "\ whitespace" sequence
creates visual ambiguities; that's a good catch.

— John