<i18n dev> RFR: 8364365: HKSCS encoder does not properly set the replacement character
Xueming Shen
sherman at openjdk.org
Wed Aug 6 18:04:28 UTC 2025
On Wed, 6 Aug 2025 10:52:00 GMT, Volkan Yazici <vyazici at openjdk.org> wrote:
>> I would assume your "double char" actually means the "surrogate pair"?
>>
>> I believe for the first pass of scanning you might want to skip the 'surrogate", as a single dangling surrogate char should trigger a "malformed" error, instead of 'unmappable", if the charset is implemented to handle supplementary character.
>>
>> for (char c = 0xFF; c < 0xFFFF; c++) {
>> if (Character.isSurrogate(c))
>> continue;
>> if (!encoder.canEncode(c))
>> return new char[]{c};
>> }
>>
>> And for the second pass for the 'surrogates", I think we can just pick any non-bmp panel, which should always be translated into a surrogate pair and check if the charset can map/encode it, if not, it's our candidate.
>>
>> for (int i = 0x10000; i < 0x1FFFF; i++) {
>> char[] cc = Character.toChars(i);
>> if (!encoder.canEncode(new String(cc)))
>> return cc;
>> }
>
>> for (char c = 0xFF; c < 0xFFFF; c++)
>
> Doesn't this exclude `0xFFFF`, which is a valid (single-`char`, non-surrogate) BMP character?
>
>> ... we can just pick any non-bmp panel ...
>> ```
>> for (int i = 0x10000; i < 0x1FFFF; i++) { ...
>> ```
>
> Doesn't the non-BMP range rather end with 0x10FFFF?
(1) we might want to include 0xffff in first pass
(2) we just need to pick any unmappable non-bmp character, i would assume that it should be pretty safe we will find one in the first non-bmp panel that is not encoded by a specific charset.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26635#discussion_r2257902674
More information about the i18n-dev
mailing list