JDK-8254073, unicode escape preprocessing, and \u005C
Alex Buckley
alex.buckley at oracle.com
Thu Jul 15 18:16:59 UTC 2021
On 7/15/2021 10:43 AM, Jim Laskey wrote:
> The fall out from discussion here and via the CSR
> (https://bugs.openjdk.java.net/browse/JDK-8269290
> <https://bugs.openjdk.java.net/browse/JDK-8269290>) is that we have two
> choices (and noting today is RD2)
>
> 1) Proceed with proposed bug fix and strengthen the existing
> Interpretation in the JLS.
>
> 2) Withdraw the CSR, fix the bug to replicate the behaviour seen prior
> to JDK 16 and rework the JLS to reflect that behaviour.
>
> At this point, Alex and I feel the correct choice is 2). This choice has
> the least risk and is likely the least disruptive.
For the record, in support of #2, here's the rework to JLS 3.3 which
makes it fully describe the behavior of javac 15. This behavior is not
completely intuitive, either to specify or implement, but it is the
historical precedent.
-----
~In addition to the processing implied by the grammar,~
+The UnicodeInputCharacter production is ambiguous because an ASCII \
character in the input stream could be reduced to either a
RawInputCharacter or to the \ of a UnicodeEscape (to be followed by an
ASCII u). To avoid ambiguity,+
[All new text follows]
for each ASCII \ character in the input stream, input processing must
consider the most recent raw input characters that resulted from this
translation step:
- If the most recent raw input character was itself translated from a
Unicode escape in the input stream, then the ASCII \ character is
eligible to begin a Unicode escape. (For example, if the most recent
raw input character in the result was a backslash that arose from a
Unicode escape \u005c in the input stream, then an ASCII \ character
in the input stream is eligible to begin another Unicode escape.)
- Otherwise, consider how many backslashes appeared contiguously as
raw input characters in the result, back to a non-backslash
character or the start of the result. (It is immaterial whether any
such backslash arose from an ASCII \ character in the input stream
or from a Unicode escape \u005c in the input stream.) If this number
is even, then the ASCII \ character is eligible to begin a Unicode
escape; if the number is odd, then the ASCII \ character is not
eligible to begin a Unicode escape.
-----
Alex
More information about the compiler-dev
mailing list