<i18n dev> RFR: 8364365: HKSCS encoder does not properly set the replacement character [v2]

Volkan Yazici vyazici at openjdk.org
Thu Aug 7 15:07:11 UTC 2025


On Wed, 6 Aug 2025 18:00:00 GMT, Xueming Shen <sherman at openjdk.org> wrote:

>>> for (char c = 0xFF; c < 0xFFFF; c++)
>> 
>> Doesn't this exclude `0xFFFF`, which is a valid (single-`char`, non-surrogate) BMP character?
>> 
>>> ... we can just pick any non-bmp panel ...
>>> ```
>>> for (int i = 0x10000; i < 0x1FFFF; i++) { ...
>>> ```
>> 
>> Doesn't the non-BMP range rather end with 0x10FFFF?
>
> (1) we might want to include 0xffff in first pass
> (2) we just need to pick any unmappable non-bmp character, i would assume that it should be pretty safe we will find one in the first non-bmp panel that is not encoded by a specific charset.

In f567f2c81a3, improved `findUnmappableNonLatin1()` as suggested:

Single-`char`:

    for (int i = 0xFF; i <= 0xFFFF; i++) {
        char c = (char) i;

Double-`char` (i.e., surrogate pair):

    int[] nonBmpRange = {0x10000, 0x10FFFF};
    for (int i = nonBmpRange[0]; i < nonBmpRange[1]; i++) {

Note that I took the incentive to use 0x10FFFF as the non-BMP range end – easier to understand the exhaustive search.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26635#discussion_r2260609569


More information about the i18n-dev mailing list