<i18n dev> RFR: 8364365: HKSCS encoder does not properly set the replacement character

Xueming Shen sherman at openjdk.org
Wed Aug 6 18:04:28 UTC 2025


On Wed, 6 Aug 2025 10:52:00 GMT, Volkan Yazici <vyazici at openjdk.org> wrote:

>> I would assume your "double char" actually means the "surrogate pair"?
>> 
>> I believe for the first pass of scanning you might want to skip the 'surrogate",  as a single dangling surrogate char should trigger a "malformed" error, instead of 'unmappable", if the charset is implemented to handle supplementary character. 
>> 
>>         for (char c = 0xFF; c < 0xFFFF; c++) {
>>             if (Character.isSurrogate(c))
>>                 continue;
>>             if (!encoder.canEncode(c))
>>                 return new char[]{c};
>>         }
>> 
>> And for the second pass for the 'surrogates", I think we can just pick any non-bmp panel, which should always be translated into a surrogate pair and check if the charset can map/encode it, if not, it's our candidate.
>> 
>>         for (int i = 0x10000; i < 0x1FFFF; i++) {
>>             char[] cc = Character.toChars(i);
>>             if (!encoder.canEncode(new String(cc)))
>>               return cc;
>>         }
>
>> for (char c = 0xFF; c < 0xFFFF; c++)
> 
> Doesn't this exclude `0xFFFF`, which is a valid (single-`char`, non-surrogate) BMP character?
> 
>> ... we can just pick any non-bmp panel ...
>> ```
>> for (int i = 0x10000; i < 0x1FFFF; i++) { ...
>> ```
> 
> Doesn't the non-BMP range rather end with 0x10FFFF?

(1) we might want to include 0xffff in first pass
(2) we just need to pick any unmappable non-bmp character, i would assume that it should be pretty safe we will find one in the first non-bmp panel that is not encoded by a specific charset.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26635#discussion_r2257902674


More information about the i18n-dev mailing list