<i18n dev> RFR: 8290488: IBM864 character encoding implementation bug

Ichiroh Takiguchi itakiguchi at openjdk.org
Fri Jul 29 05:19:31 UTC 2022


On Thu, 28 Jul 2022 16:18:51 GMT, Naoto Sato <naoto at openjdk.org> wrote:

>> Many thanks @naotoj .
>> 
>> I checked the latest IBM-864 mapping table.
>> (I assume current OpenJDK's IBM864 may refer older mapping table)
>> https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/mappings/ibm-864_X110-1999.ucm
>> .ucm file format is as follows:
>> https://unicode-org.github.io/icu/userguide/conversion/data.html#ucm-file-format
>> 
>> I checked roundtrip mapping
>> (Roundtrip entries have `|0` at the end of line)
>> | IBM864.map | ibm-864_X110-1999.ucm  |
>> | --- | --- |
>> | 0x1a    U+001a | 0x1a    U+001c |
>> | 0x1c    U+001c | 0x1c    U+007f |
>> | **0x25    U+066a** | **0x25    U+0025** |
>> | 0x7f    U+007f | 0x7f    U+001a |
>> | 0x9f    U+fffd | 0x9f    U+200b |
>> | 0xd7    U+fec1 | 0xd7    U+fec3 |
>> | 0xd8    U+fec5 | 0xd8    U+fec7 |
>> | 0xf1    U+0651 | 0xf1    U+fe7c |
>> 
>> **Note**: 0x1a <-> U+001c / 0x1c <-> U+007f /  0x7f <-> U+001a entries are control character rotation for DOS.
>> I think it should be ignored.
>> 
>> I think, roundtrip side should be changed.
>> 0x25 entry should be U+0025 on IBM864.map
>> Add `0x25 U+066a` into IBM864.c2b
>> 
>> Modify test/jdk/sun/nio/cs/mapping/Cp864.b2c for `0025 0025`
>> Add `0025 066a` into test/jdk/sun/nio/cs/mapping/Cp864.c2b-irreversible
>> 
>> This issue just for U+0025, but f possible, please add `0x9f, 0xd7, 0xd8, 0xf1` entries.
>
> Thanks for trying it out @takiguc. However, I am not planning to change any existing mappings because of the obvious compatibility issues. The fix I proposed is safe because it is additional, which used to be unmappable (thus turned into a replacement '?').

Hello @naotoj .

I checked [JDK-8290488](https://bugs.openjdk.org/browse/JDK-8290488).
This issue was tested by Windows 10.
I think we need to confirm expected result for b2c side to reporter.

I checked MS's 864 via following test program on my Windows 10.

>type b2c_1.ps1
param($code, $hex)
$h = [string]$hex
$enc_r = [Text.Encoding]::GetEncoding([int]$code)
[byte[]]$ba = @()
for($i = 0; $i -lt $h.length; $i+=2) {
  $ba += ([System.Convert]::ToInt32($h.SubString($i,2), 16))
}
$s = ""
$enc_r.GetChars($ba) | foreach {$s += [System.Convert]::ToInt32($_).ToString("X4")}
$s
>powershell -NoProfile -ExecutionPolicy Unrestricted .\b2c_1.ps1 864 25
0025


Please ignore about 0xD7,0xD8,0xF1 if the target platform is Windows.

Note: Test result for c2b side.

>type c2b_1.ps1
param($code, $hex)
$enc_r = [Text.Encoding]::GetEncoding([int]$code)
[char[]]$ca = @()
$ca += ([System.Convert]::ToInt32([string]$hex, 16))
$s = ""
$enc_r.GetBytes($ca) | foreach {$s += [System.Convert]::ToInt32($_).ToString("X2")}
$s
>powershell -NoProfile -ExecutionPolicy Unrestricted .\c2b_1.ps1 864 0025
25

>powershell -NoProfile -ExecutionPolicy Unrestricted .\c2b_1.ps1 864 066A
25

-------------

PR: https://git.openjdk.org/jdk/pull/9661


More information about the i18n-dev mailing list