<i18n dev> Fwd: Some differences on Window UDC area
Xueming Shen
xueming.shen at oracle.com
Tue May 29 23:12:03 PDT 2012
Hi Charles,
The MS936 charset is long overdue for a update. See CR#6183404. The
mapping need
to be re-generated from MS's latest 936 table (not, MS936 should just
follow MS's mapping
table, not GB18030) As noted in MS936.map, the existing mapping table
uses 1894 entries
from GBK UDC block for EUDC mapping, as suggested by IBM engineer back
to 1999, which
was a reasonable approach back then.
I will try to generate a new MS936 for JDK8.
-Sherman
On 5/23/2012 1:03 AM, Charles Lee wrote:
> Hi guys,
>
> We have a simple test case:
>
> for (String cname : new String[] { "GBK", "MS936", "GB18030" }) {
> Charset charset = Charset.forName(cname);
> System.out.println("charset: " + charset.name());
> CharsetEncoder ce = charset.newEncoder();
> char[] chars = new char[] { 0xE585, 0xE586, 0xE592 };
> CharBuffer cb = CharBuffer.wrap(chars);
> ByteBuffer bb = ce.encode(cb);
>
> for (char c : chars) {
> System.out.printf("\\u%04x", (int) c);
> }
> System.out.print(" -> ");
>
> for (byte b : bb.array())
> if (b != 0x0) {
> System.out.printf("\\x%02x", (int) b & 0xFF);
> }
> System.out.println("");
> }
>
> The output is
> charset: GBK
> \ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
> charset: x-mswin-936
> \ue585\ue586\ue592 -> \xa2\xa0\xa2\xab\xa3\x40
> charset: GB18030
> \ue585\ue586\ue592 -> \xa2\xa0\xa3\x40\xa3\x4c
>
> From the msdn[1], U+E000 -> U+F8FF is in the EUDC scope. So U+E586 is
> in the EUDC scope. But the mapped code in MS936/GBK is 0xA2AB, it is
> not in the EUDC scope.
> With another simple test case, you can find there are more codes that
> is not mapped right:
>
> for (int i = 0xE000; i < 0xE000 + 1894; i++) {
> String s = new String(new char[] { (char) i });
> byte[] bs = s.getBytes("MS936");
> int b0 = (int) bs[0] & 0xFF;
> int b1 = (int) bs[1] & 0xFF;
> if ((b0 >= 0xAA && b0 <= 0xAF) && (b1 >= 0xA1 && b1 <= 0xFE))
> continue;
> if ((b0 >= 0xF8 && b0 <= 0xFE) && (b1 >= 0xA1 && b1 <= 0xFE))
> continue;
> if ((b0 >= 0xA1 && b0 <= 0xA7) && (b1 >= 0x40 && b1 <= 0xA0))
> continue;
> System.out.printf("\\u%04X -> \\x%02X\\x%02X%n", i, b0, b1);
> }
>
>
> I have written a generator in C#[2] which outputs the mapping code in
> GB2312[3] and GB18030[4] in scope U+E000 and U+F8FF to find that most
> of code are the same. Hereby I suggest we may follow the code from
> GB2312 and the changed map file in openjdk can be found [5][6].
>
> Would anyone help to take a look on this issue?
>
>
>
> [1]
> http://msdn.microsoft.com/en-us/library/windows/desktop/dd317837%28v=vs.85%29.aspx
> [2] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/Program.cs
> [3] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb2312Map.txt
> [4] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/gb18030Map.txt
> [5] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/GBK.map.new
> [6] http://cr.openjdk.java.net/~littlee/OJDK-63/webrev.00/MS936.map.new
>
> P.S: Sorry for the late notice.
>
>
> On 03/29/2011 03:00 PM, Charles Lee wrote:
>> On 03/28/2011 11:06 PM, Alan Bateman wrote:
>>> Charles Lee wrote:
>>>> :
>>>>
>>>> It looks similar. How can I find the patch quickly? I notice it
>>>> says "the list is attached to this CR". Is it CR-6183404? Since cr
>>>> has the pattern cr.openjdk.java.net/~username/id, how can I know
>>>> who is the committer to this CR?
>>> cr.openjdk.java.net is the place where we push webrevs when a patch
>>> is out for review. I don't think this one is one anyone's list for
>>> jdk7 and the list attached to the bug is likely the list of
>>> incorrect mappings. If this is fixed then I assume the fix will
>>> update the mappings in jdk/make/tools/CharsetMapping/MS936.map.
>>>
>>> -Alan
>> I have output more bytes[1] to see whether other bytes are encoded
>> correctly. But unfortunately it is not. It is kind of like, on
>> windows, using ms936, PUA of ms936 use the PUA of gb18030. In
>> wikipedia, it says gb18030 is compatible with gbk which ms936
>> implemented. Can we conclude that ms936 should follow the gb18030's
>> behavior?
>>
>>
>> [1] 0xE585, 0xE586, 0xE587, 0xE588, 0xE589, 0xE58a, 0xE58b, 0xE58c,
>> 0xE58d, 0xE58e, 0xE58f, 0xE590, 0xE591, 0xE592, 0xE593, 0xE594,
>> 0xE595, 0xE596, 0xE597, 0xE598, 0xE599, 0xE59a, 0xE59b, 0xE59c,
>> 0xE59d, 0xE59e, 0xe79f.
>> Using MS936 charset, we expect:
>> \xa2\xa0\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa3\x4d\xa3\x4e\xa3\x4f\xa3\x50\xa3\x51\xa3\x52\xa3\x53\xa3\x54\xa3\x55\xa3\x56\xa3\x57\xa3\x58\xa6\xfe
>> but we got:
>> \xa2\xa0\xa2\xab\xa2\xac\xa2\xad\xa2\xae\xa2\xaf\xa2\xb0\xa2\xe3\xa2\xe4\xa2\xef\xa2\xf0\xa2\xfd\xa2\xfe\xa3\x40\xa3\x41\xa3\x42\xa3\x43\xa3\x44\xa3\x45\xa3\x46\xa3\x47\xa3\x48\xa3\x49\xa3\x4a\xa3\x4b\xa3\x4c\xa7\xa0
>
>
> --
> Yours Charles
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20120529/54772438/attachment.html
More information about the i18n-dev
mailing list