<i18n dev> Fwd: Some differences on Window UDC area
Charles Lee
littlee at linux.vnet.ibm.com
Wed May 30 01:42:23 PDT 2012
Hi Sherman,
Thank you for the info. MS936.map is really out-of-date. Updating it
will be very very helpful.
On 05/30/2012 02:12 PM, Xueming Shen wrote:
> Hi Charles,
>
> The MS936 charset is long overdue for a update. See CR#6183404. The
> mapping need
> to be re-generated from MS's latest 936 table (not, MS936 should just
> follow MS's mapping
> table, not GB18030) As noted in MS936.map, the existing mapping table
> uses 1894 entries
> from GBK UDC block for EUDC mapping, as suggested by IBM engineer back
> to 1999, which
> was a reasonable approach back then.
>
> I will try to generate a new MS936 for JDK8.
>
> -Sherman
>
> On 5/23/2012 1:03 AM, Charles Lee wrote:
>> Hi guys,
>>
>> We have a simple test case:
>>
>> for (String cname : new String[] { "GBK", "MS936", "GB18030" }) {
>> Charset charset = Charset.forName(cname);
>> System.out.println("charset: " + charset.name());
>> CharsetEncoder ce = charset.newEncoder();
>> char[] chars = new char[] { 0xE585, 0xE586, 0xE592 };
>> CharBuffer cb = CharBuffer.wrap(chars);
>> ByteBuffer bb = ce.encode(cb);
>>
>> for (char c : chars) {
>> System.out.printf("\\u%04x", (int) c);
>> }
>> System.out.print(" -> ");
>>
>> for (byte b : bb.array())
>> if (b != 0x0) {
>> System.out.printf("\\x%02x", (int) b & 0xFF);
>> }
>> System.out.println("");
>> }
>>
>> The output is
>> charset: GBK
>> \ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
>> charset: x-mswin-936
>> \ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
>> charset: GB18030
>> \ue585\ue586\ue592 -> \xa2\xa0\xa3\x40\xa3\x4c
>>
>> From the msdn[1], U+E000 -> U+F8FF is in the EUDC scope. So U+E586 is
>> in the EUDC scope. But the mapped code in MS936/GBK is 0xA2AB, it is
>> not in the EUDC scope.
>> With another simple test case, you can find there are more codes that
>> is not mapped right:
>>
>> for (int i = 0xE000; i < 0xE000 + 1894; i++) {
>> String s = new String(new char[] { (char) i });
>> byte[] bs = s.getBytes("MS936");
>> int b0 = (int) bs[0] & 0xFF;
>> int b1 = (int) bs[1] & 0xFF;
>> if ((b0 >= 0xAA && b0 <= 0xAF) && (b1 >= 0xA1 && b1 <= 0xFE))
>> continue;
>> if ((b0 >= 0xF8 && b0 <= 0xFE) && (b1 >= 0xA1 && b1 <= 0xFE))
>> continue;
>> if ((b0 >= 0xA1 && b0 <= 0xA7) && (b1 >= 0x40 && b1 <= 0xA0))
>> continue;
>> System.out.printf("\\u%04X -> \\x%02X\\x%02X%n", i, b0, b1);
>> }
>>
>>
>> I have written a generator in C#[2] which outputs the mapping code in
>> GB2312[3] and GB18030[4] in scope U+E000 and U+F8FF to find that most
>> of code are the same. Hereby I suggest we may follow the code from
>> GB2312 and the changed map file in openjdk can be found [5][6].
>>
>> Would anyone help to take a look on this issue?
>>
>>
>>
>> [1]
>> http://msdn.microsoft.com/en-us/library/windows/desktop/dd317837%28v=vs.85%29.aspx
>> [2] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/Program.cs
>> [3] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb2312Map.txt
>> [4] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb18030Map.txt
>> [5] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/GBK.map.new
>> [6] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/MS936.map.new
>>
>> P.S: Sorry for the late notice.
>>
>>
>> On 03/29/2011 03:00 PM, Charles Lee wrote:
>>> On 03/28/2011 11:06 PM, Alan Bateman wrote:
>>>> Charles Lee wrote:
>>>>> :
>>>>>
>>>>> It looks similar. How can I find the patch quickly? I notice it
>>>>> says "the list is attached to this CR". Is it CR-6183404? Since cr
>>>>> has the pattern cr.openjdk.java.net/~username/id, how can I know
>>>>> who is the committer to this CR?
>>>> cr.openjdk.java.net is the place where we push webrevs when a patch
>>>> is out for review. I don't think this one is one anyone's list for
>>>> jdk7 and the list attached to the bug is likely the list of
>>>> incorrect mappings. If this is fixed then I assume the fix will
>>>> update the mappings in jdk/make/tools/CharsetMapping/MS936.map.
>>>>
>>>> -Alan
>>> I have output more bytes[1] to see whether other bytes are encoded
>>> correctly. But unfortunately it is not. It is kind of like, on
>>> windows, using ms936, PUA of ms936 use the PUA of gb18030. In
>>> wikipedia, it says gb18030 is compatible with gbk which ms936
>>> implemented. Can we conclude that ms936 should follow the gb18030's
>>> behavior?
>>>
>>>
>>> [1] 0xE585, 0xE586, 0xE587, 0xE588, 0xE589, 0xE58a, 0xE58b, 0xE58c,
>>> 0xE58d, 0xE58e, 0xE58f, 0xE590, 0xE591, 0xE592, 0xE593, 0xE594,
>>> 0xE595, 0xE596, 0xE597, 0xE598, 0xE599, 0xE59a, 0xE59b, 0xE59c,
>>> 0xE59d, 0xE59e, 0xe79f.
>>> Using MS936 charset, we expect:
>>> \xa2\xa0\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa3\x4d\xa3\x4e\xa3\x4f\xa3\x50\xa3\x51\xa3\x52\xa3\x53\xa3\x54\xa3\x55\xa3\x56\xa3\x57\xa3\x58\xa6\xfe
>>> but we got:
>>> \xa2\xa0\xa2\xab\xa2\xac\xa2\xad\xa2\xae\xa2\xaf\xa2\xb0\xa2\xe3\xa2\xe4\xa2\xef\xa2\xf0\xa2\xfd\xa2\xfe\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa7\xa0
>>
>>
>> --
>> Yours Charles
--
Yours Charles
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20120530/2746624e/attachment-0001.html
More information about the i18n-dev
mailing list