Maybe codec bug in MS1252, i.e., encoding Cp1252
Xueming Shen
xueming.shen at oracle.com
Thu Sep 1 13:04:54 PDT 2011
Hi,
These 5 code points are "undefined" character in Cp1252. The first one
should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
correctly in Cp1252 charset). The mapping table you referred to is
"bestfit" type mapping table, in which it tries to provide the mapping
between the local encoding and the Unicode character set for those
characters not even exist in the local encoding. Personally I don't think
it's a good idea in most use scenario. All other official (from Microsoft)
or un-official mapping tables clearly mark these code points "undefined"
or "unused", for example
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://en.wikipedia.org/wiki/Windows-1252
http://msdn.microsoft.com/en-us/library/cc195054.aspx
btw, code below is incorrect, or it does not work the way you might
expect.
String name1 = new String( new String("兆源").getBytes("UTF-8"), "Cp1252");
String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character from
UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
bytes back to UTF-16 (which the String object uses) by using Cp1252 charset.
same for the second attempt.
What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
bytes? It's not going to be a round-trip conversion for those non-ASCII
characters.
-Sherman
On 09/01/2011 12:12 PM, Eric Liang wrote:
> Hi all,
> I've recently got an encoding error while using Cp1252 with UTF-8, the
> string converted from UTF-8 to Cp1252 can not be converted back:
>
> String name1 = new String( new String("兆源").getBytes("UTF-8"),
> "Cp1252");
> String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
>
> It looks like that there are some incorrect codes in jdk on encoding
> Cp1252, and the related codes are:
>
> 0x83 0x0192 ;Latin Small Letter F With Hook
> 0x8d 0x008d
> 0x8f 0x008f
> 0x90 0x0090
> 0x9d 0x009d
>
> ( from the Cp1252->UTF-8 map in
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
> )
>
> After I cloned the repository in http://hg.openjdk.java.net/jdk6/jdk6
> and fix these codes in MS1252.java, the encoding error has gone.
>
> I guess this is the right place to discuss this problem, and the patch
> is in the attachment. Anyone with any comment is appreciated.
>
> Regards,
> Eric
> --
> -----BEGIN GEEK CODE BLOCK-----
> Version: 3.1
> GCM/CS/E/MU/P d+(-) s: a- C++ UL$ P+>++ L++ E++ W++ N+ o+>++ K+++ w !O
> M-(+) V-- PS+ PE+ Y+ PGP++ t? 5? X? R+>* tv@ b++++ DI-- D G++ e++>+++@ h*
> r !y+
> ------END GEEK CODE BLOCK------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/jdk6-dev/attachments/20110901/f8260185/attachment.html
More information about the jdk6-dev
mailing list