<i18n dev> RFR: 8364365: HKSCS encoder does not properly set the replacement character

Xueming Shen sherman at openjdk.org
Wed Aug 6 18:35:16 UTC 2025


On Wed, 6 Aug 2025 09:58:16 GMT, Volkan Yazici <vyazici at openjdk.org> wrote:

>> we definitely want to exclude 'some' charsets here. yes, all unicode variants probably should be excluded, as they are expected to have a 'mapping' for every unicode character. Additionally, many charsets have an "internal status", meaning they might shift in and shift out its status based on input. See https://www.rfc-editor.org/rfc/rfc1468.html for an example. The encoder might/should add the shift-in/out escape sequence characters on top of the 'replacement', if the replacement character's target sub-charset does not match the 'existing' sub-charset. i would assume this is really out of the scope of this pr though :-)
>
>> if the replacement character's target sub-charset does not match the 'existing' sub-charset
> 
> In such a `replacement`, does `CharsetEncoder::isLegalReplacement` still return `true`?

it's 'tricky' :-)  some charsets have a default initial status, ascii-charset for example, this might trigger false return if the replacement is set without the appropriate shift-in/out esc-seq when target-sub-charset and existing charset are not matched.  I'm not confident that all our implementations really handle it correctly :-)  it might be interested (not really :-) given these charsets might not be that important these days and it's rare people try to change the default replacement bytes) to do full-scan-check, but again probably is not in-scope of this change.  

iso2022-jp is one such charset. we attempt to shift-in to the correct sub-charset by keeping the requested mode in **_implReplaceWith_**

        protected void implReplaceWith(byte[] newReplacement) {
            /* It's almost impossible to decide which charset they belong
               to. The best thing we can do here is to "guess" based on
               the length of newReplacement.
             */
            if (newReplacement.length == 1) {
                replaceMode = ASCII;
            } else if (newReplacement.length == 2) {
                replaceMode = JISX0208_1983;
            }
        } 

then during encoding

                            if (unmappableCharacterAction()
                                == CodingErrorAction.REPLACE
                                && currentMode != replaceMode) {
                                if (dl - dp < 3)
                                    return CoderResult.OVERFLOW;
                                if (replaceMode == ASCII) {
                                    da[dp++] = (byte)0x1b;
                                    da[dp++] = (byte)0x28;
                                    da[dp++] = (byte)0x42;
                                } else {
                                    da[dp++] = (byte)0x1b;
                                    da[dp++] = (byte)0x24;
                                    da[dp++] = (byte)0x42;
                                }
                                currentMode = replaceMode;
                            }

i believe this might not be really bulletproved.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26635#discussion_r2257966394


More information about the i18n-dev mailing list