<i18n dev> Fwd: Some differences on Window UDC area
Charles Lee
littlee at linux.vnet.ibm.com
Wed May 23 01:03:21 PDT 2012
Hi guys,
We have a simple test case:
for (String cname : new String[] { "GBK", "MS936", "GB18030" }) {
Charset charset = Charset.forName(cname);
System.out.println("charset: " + charset.name());
CharsetEncoder ce = charset.newEncoder();
char[] chars = new char[] { 0xE585, 0xE586, 0xE592 };
CharBuffer cb = CharBuffer.wrap(chars);
ByteBuffer bb = ce.encode(cb);
for (char c : chars) {
System.out.printf("\\u%04x", (int) c);
}
System.out.print(" -> ");
for (byte b : bb.array())
if (b != 0x0) {
System.out.printf("\\x%02x", (int) b & 0xFF);
}
System.out.println("");
}
The output is
charset: GBK
\ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
charset: x-mswin-936
\ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
charset: GB18030
\ue585\ue586\ue592 -> \xa2\xa0\xa3\x40\xa3\x4c
From the msdn[1], U+E000 -> U+F8FF is in the EUDC scope. So U+E586 is
in the EUDC scope. But the mapped code in MS936/GBK is 0xA2AB, it is not
in the EUDC scope.
With another simple test case, you can find there are more codes that is
not mapped right:
for (int i = 0xE000; i < 0xE000 + 1894; i++) {
String s = new String(new char[] { (char) i });
byte[] bs = s.getBytes("MS936");
int b0 = (int) bs[0] & 0xFF;
int b1 = (int) bs[1] & 0xFF;
if ((b0 >= 0xAA && b0 <= 0xAF) && (b1 >= 0xA1 && b1 <= 0xFE))
continue;
if ((b0 >= 0xF8 && b0 <= 0xFE) && (b1 >= 0xA1 && b1 <= 0xFE))
continue;
if ((b0 >= 0xA1 && b0 <= 0xA7) && (b1 >= 0x40 && b1 <= 0xA0))
continue;
System.out.printf("\\u%04X -> \\x%02X\\x%02X%n", i, b0, b1);
}
I have written a generator in C#[2] which outputs the mapping code in
GB2312[3] and GB18030[4] in scope U+E000 and U+F8FF to find that most of
code are the same. Hereby I suggest we may follow the code from GB2312
and the changed map file in openjdk can be found [5][6].
Would anyone help to take a look on this issue?
[1]
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317837%28v=vs.85%29.aspx
[2] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/Program.cs
[3] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb2312Map.txt
[4] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb18030Map.txt
[5] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/GBK.map.new
[6] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/MS936.map.new
P.S: Sorry for the late notice.
On 03/29/2011 03:00 PM, Charles Lee wrote:
> On 03/28/2011 11:06 PM, Alan Bateman wrote:
>> Charles Lee wrote:
>>> :
>>>
>>> It looks similar. How can I find the patch quickly? I notice it says
>>> "the list is attached to this CR". Is it CR-6183404? Since cr has
>>> the pattern cr.openjdk.java.net/~username/id, how can I know who is
>>> the committer to this CR?
>> cr.openjdk.java.net is the place where we push webrevs when a patch
>> is out for review. I don't think this one is one anyone's list for
>> jdk7 and the list attached to the bug is likely the list of incorrect
>> mappings. If this is fixed then I assume the fix will update the
>> mappings in jdk/make/tools/CharsetMapping/MS936.map.
>>
>> -Alan
> I have output more bytes[1] to see whether other bytes are encoded
> correctly. But unfortunately it is not. It is kind of like, on
> windows, using ms936, PUA of ms936 use the PUA of gb18030. In
> wikipedia, it says gb18030 is compatible with gbk which ms936
> implemented. Can we conclude that ms936 should follow the gb18030's
> behavior?
>
>
> [1] 0xE585, 0xE586, 0xE587, 0xE588, 0xE589, 0xE58a, 0xE58b, 0xE58c,
> 0xE58d, 0xE58e, 0xE58f, 0xE590, 0xE591, 0xE592, 0xE593, 0xE594,
> 0xE595, 0xE596, 0xE597, 0xE598, 0xE599, 0xE59a, 0xE59b, 0xE59c,
> 0xE59d, 0xE59e, 0xe79f.
> Using MS936 charset, we expect:
> \xa2\xa0\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa3\x4d\xa3\x4e\xa3\x4f\xa3\x50\xa3\x51\xa3\x52\xa3\x53\xa3\x54\xa3\x55\xa3\x56\xa3\x57\xa3\x58\xa6\xfe
> but we got:
> \xa2\xa0\xa2\xab\xa2\xac\xa2\xad\xa2\xae\xa2\xaf\xa2\xb0\xa2\xe3\xa2\xe4\xa2\xef\xa2\xf0\xa2\xfd\xa2\xfe\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa7\xa0
--
Yours Charles
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20120523/7f45c034/attachment.html
More information about the i18n-dev
mailing list