Raw string literals and Unicode escapes

Alex Buckley alex.buckley at oracle.com
Wed Feb 14 19:46:23 UTC 2018


On 2/13/2018 2:11 PM, John Rose wrote:
> On Feb 13, 2018, at 9:58 AM, Alex Buckley <alex.buckley at oracle.com
> <mailto:alex.buckley at oracle.com>> wrote:
>>
>> I suspect the trickiest part of specifying raw string literals will be
>> the lexer's modal behavior for Unicode escapes. As such, I am going to
>> put the behavior under the microscope.
>
> For an approach to this see:
> http://cr.openjdk.java.net/~jrose/jls/raw-string-pages-v4.pdf
>
> In short:  We define a so-called "preimage" for each token,
> which is the unambiguously defined sequence of UTF-16
> code points that translate to that token via \u substitution
> and line terminator normalization.
>
> For raw strings (only) the preimage of a token is significant.
> The backticks of a raw string (both opening and closing)
> are required to be their own preimage (no \u0060 allowed).
> And the raw string body contents are the preimage of the
> string token, not the normal token image.
>
> I think preimage is the trick we need here, and it settles
> a number of questions, such as those you raised.
> All of the tricky examples you raised are uniformly illegal,
> under the preimage rule for raw-string quotes.

I agree that holding on to the preimage of each InputElement (JLS 3.5) 
is necessary because ` can legitimately appear in some kinds of 
InputElement as an ordinary InputCharacter (derived from either the 
RawInputCharacter ` or the UnicodeEscape \u0060):

1.  Comment

     // This Markdown processor treats ` specially.
     /* This Markdown processor treats \u0060 specially. */

2.  Token (and more specifically, StringLiteral)

     "Hi `Bob`"
     "Hi \u0060Bob\u0060"

Only if the InputElement is a Token, and more specifically a 
RawStringLiteral, do we need to take the sequence of InputCharacters and 
LineTerminators that constitute its RawStringBody and replace that 
sequence with its preimage.

I want to say something about the delimiters of the raw string literal 
now, but I'll do that in response to Jim's mail.

Alex


More information about the amber-spec-experts mailing list