Raw string literals and Unicode escapes

Alex Buckley alex.buckley at oracle.com
Tue Feb 13 17:58:55 UTC 2018


I suspect the trickiest part of specifying raw string literals will be 
the lexer's modal behavior for Unicode escapes. As such, I am going to 
put the behavior under the microscope. Here is what the JEP has to say:

-----
Unicode escapes, in the form \uxxxx, are processed as part of character 
input prior to interpretation by the lexer. To support the raw string 
literal as-is requirement, Unicode escape processing is disabled when 
the lexer encounters an opening backtick and reenabled when encountering 
a closing backtick.
-----

I would like to assume that if the lexer comes across the six tokens \ u 
0 0 6 0  then it should interpret them as a Unicode escape representing 
a backtick _and then continue as if consuming the tokens of a raw string 
literal_. However, the mention of _an_ opening backtick and _a_ closing 
backtick gave me pause, given that repeated backticks can serve as the 
opening delimiter and the closing delimiter. For absolute clarity, let's 
write out examples to confirm intent: (Jim, please confirm or deny as 
you see fit!)

1.  String s = \u0060`;

Illegal. The RHS is lexed as ``;   which is disallowed by the grammar.

2.  String s = \u0060Hello\u0060;

Illegal. The RHS is lexed as `Hello\u0060;   and so on for the rest of 
the compilation unit -- the six tokens \ u 0 0 6 0 are not treated as a 
Unicode escape since we're lexing a raw string literal. And without a 
closing delimiter before the end of the compilation unit, a compile-time 
error occurs.

3a.  String s = \u0060Hello`;

Legal. The RHS is lexed as `Hello`;   which is well formed.

3b.  String s = \u0060\u0060Hello`;

Depends! If you take the JEP literally, then just the Unicode escape 
which serves as the first opening backtick ("_an_ opening backtick") is 
enough to enter raw-string mode. That makes the code legal: the RHS is 
lexed as `\u0060Hello`;   which is well formed. On the other hand, you 
might think that we shouldn't enter raw-string mode until the lexer in 
traditional mode has lexed the opening delimiter fully (i.e. ALL the 
opening backticks). Then, the code in 3b is illegal, because the opening 
delimiter (``) and the closing delimiter (`) are not symmetric.

I think we should take the JEP literally, so that 3b is legal. And then, 
some more examples:

4a.  String s = \u0060`Hello``;

Legal. The RHS is lexed as ``Hello``;   which is well formed.

4b.  String s = \u0060\u0060Hello``;

Illegal. The RHS is lexed as `\u0060Hello``;   which is disallowed by 
the grammar. A raw string literal containing 11 tokens is immediately 
followed by a ` token and a ; token which are not expected.

4c.  String s = \u0060\u0060Hello`\u0060;

Depends! If you take the JEP literally, where _a_ closing backtick is 
enough to re-enable Unicode escape processing, then the RHS is lexed as 
`\u0060Hello``;  which is illegal per 4b. On the other hand, if you 
think that we shouldn't re-enter traditional mode until the lexer in 
raw-string mode has lexed the closing delimiter fully (i.e. ALL the 
closing backticks), then presumably you think analogously about the 
opening delimiter, so the RHS would be lexed as ``Hello`\u0060;   which 
is illegal per 2 (no closing delimiter `` before the end of the 
compilation unit).

5.  String s = \u0060`Hello`\u0060;

I put this here because it looks nice. It hits the same issues as 3b and 4c.

Alex


More information about the amber-spec-experts mailing list