Raw string literals and Unicode escapes
Alex Buckley
alex.buckley at oracle.com
Wed Feb 14 20:24:36 UTC 2018
On 2/13/2018 2:19 PM, Jim Laskey wrote:
> 10a. String s = `abc`; 10b. String s = \u0060abc`;
>...
> So, change the scanner to
>
> A) Peek back to make sure the first open backtick was exactly a
> backtick. B) Turn off Unicode escapes immediately so that only
> backtick characters can be part of the delimiter. C) Turn on Unicode
> escapes only after a valid closing delimiter is encountered.
>
> Based on this all your examples are illegal.
I am not opposed to saying that a delimiter must be constructed from
actual ` characters (that is, the RawInputCharacter ` rather than the
UnicodeEscape \u0060). It would be silly if the opening delimiter was
\u0060 because the closing delimiter cannot be identical -- that hurts
readability. (Clearly the six characters \ u 0 0 6 0 inside a raw string
literal get no special processing.)
Unfortunately, there is nothing in the lexical grammar that prevents
\u0060Hello` or \u0060Hello\u0060 or in fact any of the examples below
from being lexed as a RawStringLiteral. The JLS will need a semantic
rule to force each RawStringDelimiter to be composed of actual `
characters. As you say, this will make all the examples below illegal.
There is plenty of precedent for semantic rules ("It is a compile-time
error ...") in the interpretation of Literal tokens, so that's fine. In
fact, JLS 3.10.4 already has a semantic rule that appears to constrain a
delimiter in a CharacterLiteral token:
It is a compile-time error for the character following the
SingleCharacter or EscapeSequence to be other than a '.
although it doesn't mean to force an actual ' character (that is, the
RawInputCharacter ' and not the UnicodeEscape \u0027). It means:
It is a compile-time error for the character following the
SingleCharacter or EscapeSequence to be other than a ' (or the
Unicode escape thereof).
Alex
>> On Feb 13, 2018, at 1:58 PM, Alex Buckley <Alex.Buckley at oracle.com>
>> wrote:
>>
>> I suspect the trickiest part of specifying raw string literals will
>> be the lexer's modal behavior for Unicode escapes. As such, I am
>> going to put the behavior under the microscope. Here is what the
>> JEP has to say:
>>
>> ----- Unicode escapes, in the form \uxxxx, are processed as part of
>> character input prior to interpretation by the lexer. To support
>> the raw string literal as-is requirement, Unicode escape processing
>> is disabled when the lexer encounters an opening backtick and
>> reenabled when encountering a closing backtick. -----
>>
>> I would like to assume that if the lexer comes across the six
>> tokens \ u 0 0 6 0 then it should interpret them as a Unicode
>> escape representing a backtick _and then continue as if consuming
>> the tokens of a raw string literal_. However, the mention of _an_
>> opening backtick and _a_ closing backtick gave me pause, given that
>> repeated backticks can serve as the opening delimiter and the
>> closing delimiter. For absolute clarity, let's write out examples
>> to confirm intent: (Jim, please confirm or deny as you see fit!)
>>
>> 1. String s = \u0060`;
>>
>> Illegal. The RHS is lexed as ``; which is disallowed by the
>> grammar.
>>
>> 2. String s = \u0060Hello\u0060;
>>
>> Illegal. The RHS is lexed as `Hello\u0060; and so on for the rest
>> of the compilation unit -- the six tokens \ u 0 0 6 0 are not
>> treated as a Unicode escape since we're lexing a raw string
>> literal. And without a closing delimiter before the end of the
>> compilation unit, a compile-time error occurs.
>>
>> 3a. String s = \u0060Hello`;
>>
>> Legal. The RHS is lexed as `Hello`; which is well formed.
>>
>> 3b. String s = \u0060\u0060Hello`;
>>
>> Depends! If you take the JEP literally, then just the Unicode
>> escape which serves as the first opening backtick ("_an_ opening
>> backtick") is enough to enter raw-string mode. That makes the code
>> legal: the RHS is lexed as `\u0060Hello`; which is well formed.
>> On the other hand, you might think that we shouldn't enter
>> raw-string mode until the lexer in traditional mode has lexed the
>> opening delimiter fully (i.e. ALL the opening backticks). Then, the
>> code in 3b is illegal, because the opening delimiter (``) and the
>> closing delimiter (`) are not symmetric.
>>
>> I think we should take the JEP literally, so that 3b is legal. And
>> then, some more examples:
>>
>> 4a. String s = \u0060`Hello``;
>>
>> Legal. The RHS is lexed as ``Hello``; which is well formed.
>>
>> 4b. String s = \u0060\u0060Hello``;
>>
>> Illegal. The RHS is lexed as `\u0060Hello``; which is disallowed
>> by the grammar. A raw string literal containing 11 tokens is
>> immediately followed by a ` token and a ; token which are not
>> expected.
>>
>> 4c. String s = \u0060\u0060Hello`\u0060;
>>
>> Depends! If you take the JEP literally, where _a_ closing backtick
>> is enough to re-enable Unicode escape processing, then the RHS is
>> lexed as `\u0060Hello``; which is illegal per 4b. On the other
>> hand, if you think that we shouldn't re-enter traditional mode
>> until the lexer in raw-string mode has lexed the closing delimiter
>> fully (i.e. ALL the closing backticks), then presumably you think
>> analogously about the opening delimiter, so the RHS would be lexed
>> as ``Hello`\u0060; which is illegal per 2 (no closing delimiter
>> `` before the end of the compilation unit).
>>
>> 5. String s = \u0060`Hello`\u0060;
>>
>> I put this here because it looks nice. It hits the same issues as
>> 3b and 4c.
>>
>> Alex
>
More information about the amber-spec-experts
mailing list