multiline: Problems with backslash-u
Remi Forax
forax at univ-mlv.fr
Wed Feb 28 13:21:53 UTC 2018
Hi Reiner,
it's 'easy' to implement, when the lexer see the prefix that indicate a raw string, it switches a bit that disable the \u recognition, and re-enable it at the end of the raw string matching.
Rémi
----- Mail original -----
> De: "Reinier Zwitserloot" <reinier at zwitserloot.com>
> À: "amber-dev" <amber-dev at openjdk.java.net>
> Envoyé: Mercredi 28 Février 2018 07:01:44
> Objet: multiline: Problems with backslash-u
> Some feedback on multiline string literals. Where 'proposal' is referenced,
> it refers to: https://bugs.openjdk.java.net/browse/JDK-8196004
>
> Each bit of feedback is rather specific and complicated; to keep threading
> intact and useful, I'll post the feedback in 5 separate items.
>
> # Problems with backslash-u #
>
> The proposal plans to treat \u as being 'raw' inside backticks. This seems
> near impossible to achieve; how would the parser even figure it out? Right
> now the parser will apply \u unescaping in a separate step at the
> beginning, and it knows to leave, say, \\u1234 alone based on the fact that
> there are an even number of backslashes in front of it, and crucially it'll
> do this regardless of where the sequence appears.
>
> Let's test:
>
> //This will not compile because the backslash-u escape is an enter \u000A
> making this part illegal java
> class Test {}
>
>
>
> versus:
>
>
>
> //But now it WILL compile \\u000A because the double backslash stops the
> backslash-u decoding
> class Test {}
>
>
> In other words: The mechanism that lets the backslash-u system 'skip' \\u
> in string literals doesn't actually understand the language itself, it only
> counts backslashes.
>
> I see no feasible way to apply the idea that just `\u000a` would actually
> make a string with 6 characters in it. The part that applies backslash-u
> escaping must run before the file is even parsed, therefore it cannot know
> that you are inside a raw string literal. It can't count backticks either,
> because:
>
> String x = "`\u002f`"; // This should be a 3 character string, not an 8
> character string.
> // `\u000a` this test file should in fact fail to compile!
>
> Tracking comments or strings is hard because backslash u escapes themselves
> can be part of it:
>
> /\u002F This is actually a comment
>
> In theory you could build a system that can do it, but it would be
> extremely complicated. One side-effect would be that other java-parsing
> code, such as ecj or intellij's parser, will either take an extremely long
> time to adapt to JDK10 which is bad for fragmenting the ecosystem (and the
> IMO dubious choices being made with the speed at which java releases are
> released and break de-facto java usage are already not helping with the
> fragmentation issue!), or, probably more likely, they'll opt out of
> rewriting a significant chunk of their parser and will just say that they
> break with spec on this hard to implement cornercase.
>
> Solution: I don't actually know; the easy way out is simply to NOT disable
> backslash-u escapes in these literals, but an 'almost totally raw' string
> literal syntax still isn't quite as useful as an 'actually raw' string
> literal syntax. One trivial way out is to dump the notion that backslash-u
> escapes work anywhere. Totally anecdotal, but every code shop I've ever
> bothered to look at how they encode their source files, encode them in
> UTF-8 and call it a day. If backslash-u escapes show up at all, they show
> up inside string literals, so a backwards-incompatible move to changing
> backslash-u escapes to be just like octal escapes (they work in string
> literals, not anywhere else) wouldn't actually break anything. Seems far
> less impactful to me versus backwards incompatibly turning the lone
> underscore character from a valid identifier to effectively a keyword and
> that has gone over without too much whining from the community.
> Unfortunately these things aren't similar enough to handwave away the
> concern, but nevertheless, I strongly advise reconsidering backslash-u
> escapes being valid anywhere, if the alternative is to add quite a
> convoluted system to try to disable backslash-u escaping in raw literals.
>
> --Reinier Zwitserloot
More information about the amber-dev
mailing list