Maybe codec bug in MS1252, i.e., encoding Cp1252
Xueming Shen
xueming.shen at oracle.com
Fri Sep 2 12:50:27 PDT 2011
On 09/02/2011 02:14 AM, Eric Liang wrote:
> On 09/02/2011 04:04 AM, Xueming Shen wrote:
>> Hi,
>>
>> These 5 code points are "undefined" character in Cp1252. The first one
>> should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
>> correctly in Cp1252 charset). The mapping table you referred to is
>> "bestfit" type mapping table, in which it tries to provide the mapping
>> between the local encoding and the Unicode character set for those
>> characters not even exist in the local encoding. Personally I don't think
>> it's a good idea in most use scenario. All other official (from
>> Microsoft)
>> or un-official mapping tables clearly mark these code points "undefined"
>> or "unused", for example
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>> http://en.wikipedia.org/wiki/Windows-1252
>> http://msdn.microsoft.com/en-us/library/cc195054.aspx
>>
>> btw, code below is incorrect, or it does not work the way you might
>> expect.
>>
>> String name1 = new String( new String("兆源").getBytes("UTF-8"),
>> "Cp1252");
>> String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
>>
>> new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character
>> from
>> UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
>> bytes back to UTF-16 (which the String object uses) by using Cp1252
>> charset.
>>
>> same for the second attempt.
>>
>> What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
>> bytes? It's not going to be a round-trip conversion for those
>> non-ASCII characters.
> Thanks Sherman for your explanation.
>
> The problem occured when I was using JDBC with MySQL. The former
> application has stored the utf8 data to a default configured database
> ( with encoding is latin1 ), and get the data and decode in PHP is OK.
> But I failed in java when reading the data. From the document(
> http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-charsets.html
> ), latin1 in MySQL corresponds with Cp1252 in JAVA, so I found the
> cause, and I deem the guy here also encountered this problem (
> http://forums.mysql.com/read.php?39,228068,228068#msg-228068 ).
>
> As since the data in latin1(in java) can be converted to utf8 freely
> and vice versa. From the wikipedia Cp1252 is treated as a superset a
> ISO_8859-1, so I guess the same expectation on Cp1252 as latin1 is
> something natural, though it does not work now.
>
> However, YMMV, would you mind give some suggestions on this? Thanks
> in advance.
>
> Eric
Windows-1252 (cp1252) is a supperset of ISO 8859-1. ISO 8859-1 is
normally referred as
the latin-1. What we have in Java charset repository is ISO-8859-1. The
difference between
ISO 8859-1 and ISO-8859-1 (with dash and without dash) is the C0 and C1
control character
area. ISO-8859-1 has the C0 and C1 defined, ISO 8859-1 does not.
So in your above workaround, you'd better use ISO-8859-1, stead of cp1252.
I know little about JDBC + MySQL, so probably not the one to give
suggestion on this topic.
By simply reading the description of the problem you are facing with, I
guess you'd better
to set your client side encoding/charset correctly to utf-8 or gbk to
receive result in Chinese
correctly.
-Sherman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/jdk6-dev/attachments/20110902/cec4afe6/attachment.html
More information about the jdk6-dev
mailing list