Maybe codec bug in MS1252, i.e., encoding Cp1252
Eric Liang
eric.l.2046 at gmail.com
Fri Sep 2 02:14:51 PDT 2011
On 09/02/2011 04:04 AM, Xueming Shen wrote:
> Hi,
>
> These 5 code points are "undefined" character in Cp1252. The first one
> should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
> correctly in Cp1252 charset). The mapping table you referred to is
> "bestfit" type mapping table, in which it tries to provide the mapping
> between the local encoding and the Unicode character set for those
> characters not even exist in the local encoding. Personally I don't think
> it's a good idea in most use scenario. All other official (from Microsoft)
> or un-official mapping tables clearly mark these code points "undefined"
> or "unused", for example
>
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
> http://en.wikipedia.org/wiki/Windows-1252
> http://msdn.microsoft.com/en-us/library/cc195054.aspx
>
> btw, code below is incorrect, or it does not work the way you might
> expect.
>
> String name1 = new String( new String("兆源").getBytes("UTF-8"),
> "Cp1252");
> String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
>
> new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character from
> UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
> bytes back to UTF-16 (which the String object uses) by using Cp1252
> charset.
>
> same for the second attempt.
>
> What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
> bytes? It's not going to be a round-trip conversion for those
> non-ASCII characters.
Thanks Sherman for your explanation.
The problem occured when I was using JDBC with MySQL. The former
application has stored the utf8 data to a default configured database (
with encoding is latin1 ), and get the data and decode in PHP is OK. But
I failed in java when reading the data. From the document(
http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-charsets.html
), latin1 in MySQL corresponds with Cp1252 in JAVA, so I found the
cause, and I deem the guy here also encountered this problem (
http://forums.mysql.com/read.php?39,228068,228068#msg-228068 ).
As since the data in latin1(in java) can be converted to utf8 freely and
vice versa. From the wikipedia Cp1252 is treated as a superset a
ISO_8859-1, so I guess the same expectation on Cp1252 as latin1 is
something natural, though it does not work now.
However, YMMV, would you mind give some suggestions on this? Thanks in
advance.
Eric
--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCM/CS/E/MU/P d+(-) s: a- C++ UL$ P+>++ L++ E++ W++ N+ o+>++ K+++ w !O
M-(+) V-- PS+ PE+ Y+ PGP++ t? 5? X? R+>* tv@ b++++ DI-- D G++ e++>+++@ h*
r !y+
------END GEEK CODE BLOCK------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/jdk6-dev/attachments/20110902/d75d3e0e/attachment.html
More information about the jdk6-dev
mailing list