Raw string literals and Unicode escapes

Tue Feb 13 22:19:03 UTC 2018

10a. String s = `abc`;
10b. String s = \u0060abc`;

As it stands both are legal. This decision has been mostly taken away from us because the lookahead of the previous token has “consumed" the character. There is little hope of finding out which form the backtick was derived. Not technically true in javac since we can sift back through the input buffer. Other tools may differ.  I’m going to ignore this remark in a second.

Choice: do we turn off escape processing on the first open backtick or the last open backtick? It doesn’t really matter as long as we do it before consuming the first non-backtick character.

Choice: do we turn on escape processing on the first close backtick or the last close backtick? It doesn’t matter as long as we do it before consuming the next non-backtick character. If we have an aborted close sequence (too few or too many backticks) then we have to turn it off again.

What about embedding \u0060 in a raw string?  If we treat them the same as backtick then the user is limited in the ways to express untranslated escapes. Note: We can always convert manually in the scanner by looking ahead for ‘\’, ‘u’, ‘0’, ‘0’, ‘6’, ‘0’.

That all said, I think we should not allow \u0060 to represent a backtick in a raw string literal, ever. It complicates things unnecessarily and limits what the user can embed in the raw string.

So, change the scanner to

A) Peek back to make sure the first open backtick was exactly a backtick.
B) Turn off Unicode escapes immediately so that only backtick characters can be part of the delimiter.
C) Turn on Unicode escapes only after a valid closing delimiter is encountered.

Based on this all your examples are illegal.

— Jim

> On Feb 13, 2018, at 1:58 PM, Alex Buckley <Alex.Buckley at oracle.com> wrote:
> 
> I suspect the trickiest part of specifying raw string literals will be the lexer's modal behavior for Unicode escapes. As such, I am going to put the behavior under the microscope. Here is what the JEP has to say:
> 
> -----
> Unicode escapes, in the form \uxxxx, are processed as part of character input prior to interpretation by the lexer. To support the raw string literal as-is requirement, Unicode escape processing is disabled when the lexer encounters an opening backtick and reenabled when encountering a closing backtick.
> -----
> 
> I would like to assume that if the lexer comes across the six tokens \ u 0 0 6 0  then it should interpret them as a Unicode escape representing a backtick _and then continue as if consuming the tokens of a raw string literal_. However, the mention of _an_ opening backtick and _a_ closing backtick gave me pause, given that repeated backticks can serve as the opening delimiter and the closing delimiter. For absolute clarity, let's write out examples to confirm intent: (Jim, please confirm or deny as you see fit!)
> 
> 1.  String s = \u0060`;
> 
> Illegal. The RHS is lexed as ``;   which is disallowed by the grammar.
> 
> 2.  String s = \u0060Hello\u0060;
> 
> Illegal. The RHS is lexed as `Hello\u0060;   and so on for the rest of the compilation unit -- the six tokens \ u 0 0 6 0 are not treated as a Unicode escape since we're lexing a raw string literal. And without a closing delimiter before the end of the compilation unit, a compile-time error occurs.
> 
> 3a.  String s = \u0060Hello`;
> 
> Legal. The RHS is lexed as `Hello`;   which is well formed.
> 
> 3b.  String s = \u0060\u0060Hello`;
> 
> Depends! If you take the JEP literally, then just the Unicode escape which serves as the first opening backtick ("_an_ opening backtick") is enough to enter raw-string mode. That makes the code legal: the RHS is lexed as `\u0060Hello`;   which is well formed. On the other hand, you might think that we shouldn't enter raw-string mode until the lexer in traditional mode has lexed the opening delimiter fully (i.e. ALL the opening backticks). Then, the code in 3b is illegal, because the opening delimiter (``) and the closing delimiter (`) are not symmetric.
> 
> I think we should take the JEP literally, so that 3b is legal. And then, some more examples:
> 
> 4a.  String s = \u0060`Hello``;
> 
> Legal. The RHS is lexed as ``Hello``;   which is well formed.
> 
> 4b.  String s = \u0060\u0060Hello``;
> 
> Illegal. The RHS is lexed as `\u0060Hello``;   which is disallowed by the grammar. A raw string literal containing 11 tokens is immediately followed by a ` token and a ; token which are not expected.
> 
> 4c.  String s = \u0060\u0060Hello`\u0060;
> 
> Depends! If you take the JEP literally, where _a_ closing backtick is enough to re-enable Unicode escape processing, then the RHS is lexed as `\u0060Hello``;  which is illegal per 4b. On the other hand, if you think that we shouldn't re-enter traditional mode until the lexer in raw-string mode has lexed the closing delimiter fully (i.e. ALL the closing backticks), then presumably you think analogously about the opening delimiter, so the RHS would be lexed as ``Hello`\u0060;   which is illegal per 2 (no closing delimiter `` before the end of the compilation unit).
> 
> 5.  String s = \u0060`Hello`\u0060;
> 
> I put this here because it looks nice. It hits the same issues as 3b and 4c.
> 
> Alex