JDK-8254073, unicode escape preprocessing, and \u005C
John Rose
john.r.rose at oracle.com
Wed Jun 23 20:59:59 UTC 2021
Yup, no urgency!
And, yes "-Xlint:unicode ... default on” would do the trick.
(…Assuming that -Xlint:unicode would not give false
positives, for non-ASCII encodings. The most legitimate
uses of \uXXYY are when XX > 01. Those uses allow
Java source code to be written in 7-bit ASCII.)
> On Jun 23, 2021, at 8:13 AM, Jim Laskey <james.laskey at oracle.com> wrote:
>
> I assume y'all are suggesting this for 18. I want to make sure the main fix makes it into LTS 17.
>
>> On Jun 23, 2021, at 12:10 PM, Jonathan Gibbons <jonathan.gibbons at oracle.com> wrote:
>>
>> How about -Xlint:unicode ... default on.
>>
>> -- Jon
>>
>> On 6/22/21 7:33 PM, John Rose wrote:
>>> It would be good if javac gave a warning when fed
>>> highly questionable puzzlers like the sequence of
>>> code points \ u 0 0 5 C. There’s no excuse for using
>>> it, and although the JLS tolerates it, it is almost
>>> certainly a mark of someone confusing themselves,
>>> or trying to confuse others.
>>>
>>> The deepest problems are with the unicode escape
>>> for the character (005C) which introduces the
>>> unicode escape. But I would also welcome a wider
>>> warning, which would report any use of a unicode
>>> escape which decodes to a legitimate token constituent
>>> in the basic ASCII set.
>>>
>>> For safety’s sake, I would want to warn on any printable
>>> (non-control) code point between 0020 and 007E inclusive,
>>> plus line terminators 000A and 000D.
>>>
>>> Such warnings would help train users away from
>>> writing obfuscated code, even if they thought they
>>> had a reason to do so, and it would also help users
>>> detect maliciously obfuscated code.
>>>
>>> Supposedly it’s useful to (once in a blue moon)
>>> re-encode everything in Java source file using
>>> unicode escapes (maybe for blank-free URLs?)
>>> but in such cases the warnings can be disabled
>>> and disregarded. Apart from blue moons,
>>> nobody ever, ever wants to get confused by
>>> unicode escapes which make a program less
>>> readable.
>>>
>>> — John
>>>
>>>> On Jun 21, 2021, at 2:56 PM, Jim Laskey <james.laskey at oracle.com> wrote:
>>>>
>>>> "\u005C” should have been treated as a backslash. Will check into it.
>>>>
>>>> Cheers,
>>>>
>>>> — Jim
>>>>
>>>>
>>>>
>>>>> On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com> wrote:
>>>>>
>>>>>
>>>>> class T {
>>>>> public static void main(String[] args) {
>>>>> System.err.println("\u005C\\u005D");
>>>>> }
>>>>> }
>>>>>
>>>>> Before JDK-8254073, this prints `\]`.
>>>>>
>>>>> After JDK-8254073, unicode escape processing results in `\\\u005D`, which results in an 'invalid escape' error for `\u`. Was that deliberate?
>>>>>
>>>>> JLS 3.3 says
>>>>>
>>>>>> for each raw input character that is a backslash \, input processing must consider how many other \ characters contiguously precede it, separating it from a non-\ character or the start of the input stream. If this number is even, then the \ is eligible to begin a Unicode escape; if the number is odd, then the \ is not eligible to begin a Unicode escape.
>>>>> The difference is in whether `\u005C` (the unicode escape for `\`) counts as one of the `\` preceding a valid unicode escape.
>>>>>
>>>>> Does "how many other \ characters contiguously precede it" refer to preceding raw input characters, or does it refer to preceding characters after unicode escape processing is performed on them?
>>>>>
>>>>> JLS 3.3 also mentions that a "character produced by a Unicode escape does not participate in further Unicode escapes", but I'm not sure if that applies here, since in the pre-JDK-8254073 interpretation the unicode-escaped backslash isn't really 'participating' in the second unicode escape.
>>>>>
>>>>> Thanks,
>>>>> Liam
>
More information about the compiler-dev
mailing list