Maybe codec bug in MS1252, i.e., encoding Cp1252

Xueming Shen xueming.shen at oracle.com
Thu Sep 1 13:04:54 PDT 2011


Hi,

These 5 code points are "undefined" character in Cp1252. The first one
should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
correctly in Cp1252 charset). The mapping table you referred to is
"bestfit" type mapping table, in which it tries to provide the mapping
between the local encoding and the Unicode character set for those
characters not even exist in the local encoding. Personally I don't think
it's a good idea in most use scenario. All other official (from Microsoft)
or un-official mapping tables clearly mark these code points "undefined"
or "unused", for example

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://en.wikipedia.org/wiki/Windows-1252
http://msdn.microsoft.com/en-us/library/cc195054.aspx

btw,  code below is incorrect,  or it does not work the way you might 
expect.

String name1 = new String( new String("兆源").getBytes("UTF-8"), "Cp1252");
String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");

new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character from
UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
bytes back to UTF-16 (which the String object uses) by using Cp1252 charset.

same for the second attempt.

What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
bytes? It's not going to be a round-trip conversion for those non-ASCII 
characters.

-Sherman


On 09/01/2011 12:12 PM, Eric Liang wrote:
> Hi all,
> I've recently got an encoding error while using Cp1252 with UTF-8, the 
> string converted from UTF-8 to Cp1252 can not be converted back:
>
>     String name1 = new String( new String("兆源").getBytes("UTF-8"),
>     "Cp1252");
>     String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
>
> It looks like that there are some incorrect codes in jdk on encoding 
> Cp1252, and the related codes are:
>
>     0x83    0x0192    ;Latin Small Letter F With Hook
>     0x8d    0x008d
>     0x8f    0x008f
>     0x90    0x0090
>     0x9d    0x009d
>
>     ( from the Cp1252->UTF-8 map in
>     http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
>     )
>
> After I cloned the repository in http://hg.openjdk.java.net/jdk6/jdk6 
> and fix these codes in MS1252.java, the encoding error has gone.
>
> I guess this is the right place to discuss this problem, and the patch 
> is in the attachment. Anyone with any comment is appreciated.
>
> Regards,
> Eric
> -- 
> -----BEGIN GEEK CODE BLOCK-----
> Version: 3.1
> GCM/CS/E/MU/P d+(-) s: a- C++ UL$ P+>++ L++ E++ W++ N+ o+>++ K+++ w !O
> M-(+) V-- PS+ PE+ Y+ PGP++ t? 5? X? R+>* tv@ b++++ DI-- D G++ e++>+++@ h*
> r !y+
> ------END GEEK CODE BLOCK------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/jdk6-dev/attachments/20110901/f8260185/attachment.html 


More information about the jdk6-dev mailing list