JDK-8254073, unicode escape preprocessing, and \u005C

Wed Jun 23 15:10:34 UTC 2021

How about -Xlint:unicode ... default on.

-- Jon

On 6/22/21 7:33 PM, John Rose wrote:
> It would be good if javac gave a warning when fed
> highly questionable puzzlers like the sequence of
> code points \ u 0 0 5 C.  There’s no excuse for using
> it, and although the JLS tolerates it, it is almost
> certainly a mark of someone confusing themselves,
> or trying to confuse others.
>
> The deepest problems are with the unicode escape
> for the character (005C) which introduces the
> unicode escape.  But I would also welcome a wider
> warning, which would report any use of a unicode
> escape which decodes to a legitimate token constituent
> in the basic ASCII set.
>
> For safety’s sake, I would want to warn on any printable
> (non-control) code point between 0020 and 007E inclusive,
> plus line terminators 000A and 000D.
>
> Such warnings would help train users away from
> writing obfuscated code, even if they thought they
> had a reason to do so, and it would also help users
> detect maliciously obfuscated code.
>
> Supposedly it’s useful to (once in a blue moon)
> re-encode everything in Java source file using
> unicode escapes (maybe for blank-free URLs?)
> but in such cases the warnings can be disabled
> and disregarded.  Apart from blue moons,
> nobody ever, ever wants to get confused by
> unicode escapes which make a program less
> readable.
>
> — John
>
>> On Jun 21, 2021, at 2:56 PM, Jim Laskey <james.laskey at oracle.com> wrote:
>>
>> "\u005C” should have been treated as a backslash. Will check into it.
>>
>> Cheers,
>>
>> — Jim
>>
>> ��
>>
>>> On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com> wrote:
>>>
>>> 
>>> class T {
>>>   public static void main(String[] args) {
>>>     System.err.println("\u005C\\u005D");
>>>   }
>>> }
>>>
>>> Before JDK-8254073, this prints `\]`.
>>>
>>> After JDK-8254073, unicode escape processing results in `\\\u005D`, which results in an 'invalid escape' error for `\u`. Was that deliberate?
>>>
>>> JLS 3.3 says
>>>
>>>> for each raw input character that is a backslash \, input processing must consider how many other \ characters contiguously precede it, separating it from a non-\ character or the start of the input stream. If this number is even, then the \ is eligible to begin a Unicode escape; if the number is odd, then the \ is not eligible to begin a Unicode escape.
>>> The difference is in whether `\u005C` (the unicode escape for `\`) counts as one of the `\` preceding a valid unicode escape.
>>>
>>> Does "how many other \ characters contiguously precede it" refer to preceding raw input characters, or does it refer to preceding characters after unicode escape processing is performed on them?
>>>
>>> JLS 3.3 also mentions that a "character produced by a Unicode escape does not participate in further Unicode escapes", but I'm not sure if that applies here, since in the pre-JDK-8254073 interpretation the unicode-escaped backslash isn't really 'participating' in the second unicode escape.
>>>
>>> Thanks,
>>> Liam