RFR: 8354908: javac mishandles supplementary character in character literal [v2]

Mon May 12 22:04:51 UTC 2025

On Mon, 12 May 2025 11:16:09 GMT, Jan Lahoda <jlahoda at openjdk.org> wrote:

>> Some Unicode characters consist of two surrogates, i.e. two `char`s. And, such Unicode characters cannot be part of a char literal, as there's no way to represent them as a character literal. But, javac currently accepts code with such characters, and only puts the char, the high surrogate, into the literal, ignoring the second one.
>> 
>> For example, the JDK 24 behavior is:
>> 
>> $ cat /tmp/T.java 
>> public class T {
>>     public static void main(String... args) {
>>        char c = '😊';
>>        System.err.println(Integer.toHexString((int) c));
>>        System.err.println(Character.isHighSurrogate(c));
>>     }
>> }
>> $ java /tmp/T.java
>> d83d
>> true
>> 
>> 
>> But, in JDK 11, such literals have been rejected:
>> 
>> $ java /tmp/T.java
>> /tmp/T.java:3: error: unclosed character literal
>>        char c = '😊';
>>                 ^
>> /tmp/T.java:3: error: illegal character: '\ude0a'
>>        char c = '😊';
>>                   ^
>> /tmp/T.java:3: error: unclosed character literal
>>        char c = '😊';
>>                    ^
>> 3 errors
>> error: compilation failed
>> 
>> 
>> The proposal in this PR is to explicitly check for this case when scanning character literal, and produce explicit error when a multi-surrogate character is used. javac will produce an error like:
>> 
>> $ java /tmp/T.java
>> /tmp/T.java:3: error: character literal contains more than one UTF-16 code point
>>        char c = '😊';
>>                 ^
>> 1 error
>> error: compilation failed
>
> Jan Lahoda has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Reflecting review comment: using UTF-16 code unit

Marked as reviewed by naoto (Reviewer).

-------------

PR Review: https://git.openjdk.org/jdk/pull/24964#pullrequestreview-2834731042