String literals: some principles

Sat May 4 00:43:45 UTC 2019

On May 3, 2019, at 3:25 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> And here's a draconian one:  Forbid <\ u X X X X>
> where the code point is 007F or lower.  That would
> blow up some stupid test cases and puzzlers; user
> code that does this should be fixed.  If we can't do
> this everywhere, do it inside string bodies.
> 
> We may be limited by backward compatibility on the
> application of these ideas to thin strings, but they should
> be considered at least for fat strings.

Here's an example of how \uXXXX escapes could be
brought into alignment with the escape sublanguage:

https://docs.oracle.com/javase/specs/jls/se12/html/jls-3.html#jls-3.10.6
> 3.10.6. Escape Sequences for Character and String Literals
…
> It is a compile-time error if the character following a backslash in an escape sequence is not an ASCII b, t, n, f, r, ", ', \,
+  [?space, LineTerminator,?]
> 0, 1, 2, 3, 4, 5, 6, or 7. The Unicode escape \u is processed earlier (§3.3).

+In a [?fat?] string literal, no part of the open or closing quote, or of
+any escape sequence, or of any stripped whitespace, may contain
+a character that was derived (in the earlier processing) from
+a Unicode escape
+[?, unless the first character of the literal, a ", was also derived
+from a Unicode escape?]
+.

> Octal escapes are provided for compatibility with C, but can express only Unicode values \u0000 through \u00FF, so Unicode escapes are usually preferred.
+(In a string literal we forbid Unicode escapes for characters which
+steer the lexical syntax of the literal.  This makes it easier to
+read.  [?The exception allows Java programs to be encoded with
+dense use of Unicode escapes, as long as the open-quotes are
+so encoded.?])

If we omit [fat] in the above, we get an incompatible change to
thin strings.  But I think it would actually be the right move.

Here's a puzzler I just thought of:

var puz = "\1\u0032";
// puz = '\1'+"0" or '\10'+""?

This is a one-character string "\n".  If \u escapes were
a proper part of the escape sub-language, then puz
would be a two-character string.

Here's a place where prior-expansion of \u escapes
interferes with the structure of fat strings:

var fat = """
       \u0020 hello
       """;
// fat = "hello\n" or "  hello\n"?

We can stop caring about the awkward phasing of \u
escapes if and only if we make a restriction that \u
escapes can't mix with other parts of string syntax,
as above.  This goes for the new syntax as well as
the old.  It's easier to impose such a rule on new
syntax, of course.

This sort of thing makes me want to put the restriction
on all string (and character) literals.  It seems to me that
only deliberately obfuscated code would fall afoul of it.

If that's really true, this feature is completely separable
from fat strings or any other menu items, as long as we
are willing to apply it after the fact, incompatibly with
obfuscated code.

— John