<i18n dev> RFR: 8364365: HKSCS encoder does not properly set the replacement character
Xueming Shen
sherman at openjdk.org
Wed Aug 6 18:35:16 UTC 2025
On Wed, 6 Aug 2025 09:58:16 GMT, Volkan Yazici <vyazici at openjdk.org> wrote:
>> we definitely want to exclude 'some' charsets here. yes, all unicode variants probably should be excluded, as they are expected to have a 'mapping' for every unicode character. Additionally, many charsets have an "internal status", meaning they might shift in and shift out its status based on input. See https://www.rfc-editor.org/rfc/rfc1468.html for an example. The encoder might/should add the shift-in/out escape sequence characters on top of the 'replacement', if the replacement character's target sub-charset does not match the 'existing' sub-charset. i would assume this is really out of the scope of this pr though :-)
>
>> if the replacement character's target sub-charset does not match the 'existing' sub-charset
>
> In such a `replacement`, does `CharsetEncoder::isLegalReplacement` still return `true`?
it's 'tricky' :-) some charsets have a default initial status, ascii-charset for example, this might trigger false return if the replacement is set without the appropriate shift-in/out esc-seq when target-sub-charset and existing charset are not matched. I'm not confident that all our implementations really handle it correctly :-) it might be interested (not really :-) given these charsets might not be that important these days and it's rare people try to change the default replacement bytes) to do full-scan-check, but again probably is not in-scope of this change.
iso2022-jp is one such charset. we attempt to shift-in to the correct sub-charset by keeping the requested mode in **_implReplaceWith_**
protected void implReplaceWith(byte[] newReplacement) {
/* It's almost impossible to decide which charset they belong
to. The best thing we can do here is to "guess" based on
the length of newReplacement.
*/
if (newReplacement.length == 1) {
replaceMode = ASCII;
} else if (newReplacement.length == 2) {
replaceMode = JISX0208_1983;
}
}
then during encoding
if (unmappableCharacterAction()
== CodingErrorAction.REPLACE
&& currentMode != replaceMode) {
if (dl - dp < 3)
return CoderResult.OVERFLOW;
if (replaceMode == ASCII) {
da[dp++] = (byte)0x1b;
da[dp++] = (byte)0x28;
da[dp++] = (byte)0x42;
} else {
da[dp++] = (byte)0x1b;
da[dp++] = (byte)0x24;
da[dp++] = (byte)0x42;
}
currentMode = replaceMode;
}
i believe this might not be really bulletproved.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/26635#discussion_r2257966394
More information about the i18n-dev
mailing list