multiline: Problems with backslash-u

Wed Feb 28 06:01:44 UTC 2018

Some feedback on multiline string literals. Where 'proposal' is referenced,
it refers to: https://bugs.openjdk.java.net/browse/JDK-8196004

Each bit of feedback is rather specific and complicated; to keep threading
intact and useful, I'll post the feedback in 5 separate items.

# Problems with backslash-u #

The proposal plans to treat \u as being 'raw' inside backticks. This seems
near impossible to achieve; how would the parser even figure it out? Right
now the parser will apply \u unescaping in a separate step at the
beginning, and it knows to leave, say, \\u1234 alone based on the fact that
there are an even number of backslashes in front of it, and crucially it'll
do this regardless of where the sequence appears.

Let's test:

//This will not compile because the backslash-u escape is an enter \u000A
making this part illegal java
class Test {}

versus:

//But now it WILL compile \\u000A because the double backslash stops the
backslash-u decoding
class Test {}

In other words: The mechanism that lets the backslash-u system 'skip' \\u
in string literals doesn't actually understand the language itself, it only
counts backslashes.

I see no feasible way to apply the idea that just `\u000a` would actually
make a string with 6 characters in it. The part that applies backslash-u
escaping must run before the file is even parsed, therefore it cannot know
that you are inside a raw string literal. It can't count backticks either,
because:

String x = "`\u002f`"; // This should be a 3 character string, not an 8
character string.
// `\u000a` this test file should in fact fail to compile!

Tracking comments or strings is hard because backslash u escapes themselves
can be part of it:

/\u002F This is actually a comment

In theory you could build a system that can do it, but it would be
extremely complicated. One side-effect would be that other java-parsing
code, such as ecj or intellij's parser, will either take an extremely long
time to adapt to JDK10 which is bad for fragmenting the ecosystem (and the
IMO dubious choices being made with the speed at which java releases are
released and break de-facto java usage are already not helping with the
fragmentation issue!), or, probably more likely, they'll opt out of
rewriting a significant chunk of their parser and will just say that they
break with spec on this hard to implement cornercase.

Solution: I don't actually know; the easy way out is simply to NOT disable
backslash-u escapes in these literals, but an 'almost totally raw' string
literal syntax still isn't quite as useful as an 'actually raw' string
literal syntax. One trivial way out is to dump the notion that backslash-u
escapes work anywhere. Totally anecdotal, but every code shop I've ever
bothered to look at how they encode their source files, encode them in
UTF-8 and call it a day. If backslash-u escapes show up at all, they show
up inside string literals, so a backwards-incompatible move to changing
backslash-u escapes to be just like octal escapes (they work in string
literals, not anywhere else) wouldn't actually break anything. Seems far
less impactful to me versus backwards incompatibly turning the lone
underscore character from a valid identifier to effectively a keyword and
that has gone over without too much whining from the community.
Unfortunately these things aren't similar enough to handwave away the
concern, but nevertheless, I strongly advise reconsidering backslash-u
escapes being valid anywhere, if the alternative is to add quite a
convoluted system to try to disable backslash-u escaping in raw literals.

 --Reinier Zwitserloot