String literals: some principles

Fri May 3 22:25:49 UTC 2019

TL;DR: Good framework; must also account for the
rectangle extraction rule (RER).  A unified escape
sublanguage (ESL) is highly desirable, and I propose
adding <\ > and <\ LT WS*> as escapes for space
and for null string.  The existing \ char is OK, and
should be "fattened" as a separate feature.  I note
some issues with <\ u X X X X>.

On Apr 28, 2019, at 1:32 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> - Opening delimiter
> - Closing delimiter
> - Escape characters, if any
> - Escape sublanguages, if any

Yes, this is a useful way to break down the syntax.

You left out padding conventions as a degree of freedom.
Padding conventions given the programmer detailed control
over the format of the program by associating non-payload
characters with the string literal.  Whitespace rectangle
extraction is the only padding convention we are discussing,
plus occasional suggestions that we remove horizontal
space in one-line fat strings.

If we denote today's escape sublanguage as ESL and the
rectangle extraction rule as RER, then today's literals are:
  ThinString=SL[open=close=", escape=\, esl=ESL, pc=none]

Tomorrow's fat strings will be something like:
  FatString=SL[open=close=""", escape=\, esl=ESL, pc=RER]

Another aspect of defining a string literal is the *phasing*
of the different features.  I think we have good consensus
that padding should be stripped *before* escape interpretation,
so that escaped characters are not mistaken for padding
characters.

> I bring this up not because I want to talk about raw-ness now (getting the hint?), but because I want to keep all the variations of string literals as lightly-varying projections of the same basic feature.

Understanding the variations is important.  It also gives
me hope that we could parley this framework, later on,
into something strong.

<digression>
In the future (not now) we might add a parameterized
range of these schemes:
  StrongString<N>=SL[open=close=F(N), escape=G(N), esl=ESL, pc=RER]
for some functions F, G that enumerate quote and escape
tokens.  This would be a strong quoting scheme that could
(with care) allow any given payload string S to be embedded
without the need for escapes, by choosing an N for which
F(N) and G(N) do not occur in S.
</digression>

Getting back to today, I want to talk about escapes.

First, I'll remind us all that the RER is part of fat strings and
that therefore the newline and space characters are no longer
just passive string body characters, but rather play a role in
the string syntax.

This means that the ESL needs to be upgraded so that
occurrences of strings and newlines which otherwise
would play a role in syntax can be escaped.  I think this
at a minimum means that the ESL needs to add support
for the two character escape sequence <\ space>.
There is already an escape sequence for a line terminator;
it is <\ n>.  A similar point holds for <\ t>.  These
three escapes (one new, two old) are enough to allow
a programmer to tell the RER to stay away from a
particular bit of white-space.

(Note that if the RER were to happen *after* escape
processing, we'd be in a pickle:  There's be no way
to use the existing ESL to control the RER, and we'd
have to put some sort of extra control feature into the
RER itself, or settle for an uncontrollable RER.)

> It has come up, for example, that we might treat \<newline> differently in ML strings as in classic strings,

My own suggestions in this vein have nothing to do with
making a new ESL but with extending the old one so it
works well with fat strings.

> but I would prefer it we could not tinker with the escape language in nonuniform ways — as this minimizes the variations between the various sub-features.

I agree that we should have only one ESL; there's no
reason to have different "dialects" of it in different
types of strings.

So <\ space> should be added to the ESL, not because
it's particularly useful for thin strings, but because it
escapes otherwise strippable padding in fat strings.

Here's an interesting feature of the JLS:  It defines a
uniform ESL for both string and character literals.
This means that <\ '> can occur in both kinds of
literals, even though it is only needed for character
literals.  Same point in reverse for <\ ">.  Since
the ESL is uniform, if *one* kind of literal needs
a particular escape sequence, then *all* the literals
have it.  (See where I'm going?)  Now, the upcoming
features of fat strings includes a padding convention,
ergo the common ESL needs a way to escape the
now-syntactic padding characters.

About <\ LT> (an escaped LineTerminator), a similar
point holds:  Sure it's useful only in string literals with
line terminators, but if there is a legitimate reason to
add extra control over LTs, then <\ LT> gets bundled
into the common escape sublanguage of the JLS.

There are two interesting questions about positioning
<\ LT> as an escape sequence:

1. What does <\ LT> mean, if it is legal and not just an
alias for <\ n>?

2. Is <\ LT> allowed in a thin string, given that (currently)
the thin string syntax rejects LT?

For 1. I'm already on record as proposing that <\ LT WS*>
is an escape sequence for the null string.  (WS is horizontal
whitespace.)

For 2., if we say "no" then we seem to come close to forking
the ESL, which Brian and I want to avoid.  A thin string body
is a sequence of regular non-LT chars plus escape sequences,
except <\ LT>.  A fat string body can include <\ LT> as well
as other escape sequences.

But that is not really a fork of the ESL.  The difference between
fat and thin strings is a structural constraint on their bodies,
before escape processing:  A fat string can contain LT in its
pre-escape-processed body, and so in fact can contain <\ LT>.
A thin string cannot contain LT at all, so the presence of <\ LT>
in the ESL is moot for a thin string.  (Also moot for a char
literal.)

The parsing of a string literal (either kind) consists of
gathering an escaped string body while looking for
the close-quote.  The close-quote interrupts the body
and terminates the string.  For the case of a thin string,
an LT also interrupts the body, but causes parsing to fail.

So we could answer "no" to 2 and keep a unified ESL,
simply by asserting that thin string tokens never contain LT,
while fat string tokens contain LT (always?  different question).

We could also answer "yes" to 2, and I think it's worth
a discussion.  What I'm suggesting here is that the
thin strings are allowed to contain *escaped* LTs
in a new version of the JLS (that also contains fat
strings).  The pre-escape-processed body of either
kind of string can contain escaped LTs, and fat
strings can *also* contain *unescaped* LTs.

Example:

    var ts = "hel\
                lo\
                ";
    assert ts == "hello";
    var fs = """
                hel\
                lo\
                """;
    assert fs == "hello";

In the latter case, the RER strips most or all of the
whitespace.  In any case <\ LT WS*> sops up the
rest.

The reason we are discussing <\ LT> is that there
are plenty of reasons why programmers would wish
to control the format of their programs by breaking
up long logical lines into shorter physical lines.
Such use cases are not specific to payloads with
or without newlines.  If your payload has newlines,
use a fat string *and* break up long logical lines
into shorter physical ones.  If you payload has
no newlines (maybe it's a very long hex number),
then use a thin string, and break it up.

The RER of fat strings (which I like!) prompts the
discussion of breaking up logical lines into physical
ones, more than thin strings.  After all, with thin
strings, you break one line into two lines, it's a
given that you are going to write two literals,
and then the + sign (for concatenation) adds
no additional overhead.  The break-up sequence
is something like <" LT WS + ">

But if you have a large MLS with a few very long
logical lines, suddenly you have an invidious
choice between keeping your nice rectangle,
or disrupting it totally by adding <" LT WS + ">.
Breaking a long line in this case drops you off
a syntax cliff.  Supporting </ LT WS> lets you
down easy, by breaking the logical lines without
disrupting the enclosing padding of the rectangle
extraction rule.

> Soliciting discussion on the pros and cons of keeping \ as our escape character.

Well, \ makes a very fine escape character, except for
particular payloads when it doesn't.  Any payload which
is a program in some little language that uses \ for
escaping is going get confusing very fast.  Nobody
wants to count a train of escapes, and layers of escaping
cause escape trains to lengthen fast (doubling with each layer).
Regular expressions are the poster child, and I'll just
pretend that they are the key use case, since they
are the worst-behaved.

Fattening \ to \\\ helps a little with REs.  But it would
make long trains even longer, with the result that you
would need even more help keeping count.   The eye
can only count a small number of repeated characters
at a glance.

var re = "\\\\\\[";  //train wreck for /\\\[/
assert ('\\'+"[").matches(re);

A non-repeating escape is much easier on the eye.
Choosing at random, I'll suggest <\ -> as a fattened
escape sequence, with the standard ESL from the JLS
(as amended with <\ space> etc).  As long as that
particular pair of characters is rare in REs (and other
similar venues), there won't be any long trains
of backslashes.

var re = *"\\\[";
assert ('\\'+"[").matches(re);
var s6 = *"\-\- \-" \";
assert s6 == '\\'+"- \" "+'\\';

The star shows that I'm talking about some non-standard
string syntax:

  FatEscString=SL[open=*", close=", escape=\-, esl=ESL, pc=none]

I think it would be reasonable to fatten escapes as a separate
feature, but not in tandem with the current multi-line string
proposal.

<digression>
Straw man, separate from the MLS proposal.

If a string literal (either fat or thing) is immediate preceded
by <\ ->, the body of the string uses that sequence for its
escapes instead of \.  The ESL is unchanged.

If stronger escapes are also desired, the feature can be
extended simply by allowing any number of - characters,
e.g. \--"x\-y\z" and \--"\--n" (for "x\\-y\\z" and "\n").
</digression>

We are leaving \uXXXX escapes out of the accounting.  This
is understandable, because they are not a regular part of
the ESL, and hard to treat as part of it.  But we should try.
In particular, we can and should find a way to treat most
or all of the \uXXXX escapes *in a string body* as being
expanded as part of the ESL, rather than a pre-pass.
This will make \uXXXX escapes more complicated, but
it may profitably simplify their effect on the user model.

One idea is simple:  In the body of a string, any \uXXXX
which doesn't denote a controlling part of the string syntax
(quote or backslash) is collected into the string body as
an unexpanded character sequence <\ u X X X X>.
This sequence is then supported by the ESL.

The effect is that padding removal (rectangle extraction)
happens before \u replacement *in a string body*.

A second idea could be adopted either with the first
or separately:  As a structural constraint on string bodies,
unicode sequences which would expand to whitespace,
quote, or backslash are forbidden.

And here's a draconian one:  Forbid <\ u X X X X>
where the code point is 007F or lower.  That would
blow up some stupid test cases and puzzlers; user
code that does this should be fixed.  If we can't do
this everywhere, do it inside string bodies.

We may be limited by backward compatibility on the
application of these ideas to thin strings, but they should
be considered at least for fat strings.

There are two benefits to taming \uXXXX:

1. Fewer puzzlers involving hidden syntax (\ " etc.)

2. The processing of \uXXXX for string bodies can
be documented and aligned with an "unescape" method
on String, which is useful in its own right.