Maybe codec bug in MS1252, i.e., encoding Cp1252

Xueming Shen xueming.shen at oracle.com
Fri Sep 2 12:50:27 PDT 2011


On 09/02/2011 02:14 AM, Eric Liang wrote:
> On 09/02/2011 04:04 AM, Xueming Shen wrote:
>> Hi,
>>
>> These 5 code points are "undefined" character in Cp1252. The first one
>> should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
>> correctly in Cp1252 charset). The mapping table you referred to is
>> "bestfit" type mapping table, in which it tries to provide the mapping
>> between the local encoding and the Unicode character set for those
>> characters not even exist in the local encoding. Personally I don't think
>> it's a good idea in most use scenario. All other official (from 
>> Microsoft)
>> or un-official mapping tables clearly mark these code points "undefined"
>> or "unused", for example
>>
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>> http://en.wikipedia.org/wiki/Windows-1252
>> http://msdn.microsoft.com/en-us/library/cc195054.aspx
>>
>> btw,  code below is incorrect,  or it does not work the way you might 
>> expect.
>>
>> String name1 = new String( new String("兆源").getBytes("UTF-8"), 
>> "Cp1252");
>> String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
>>
>> new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character 
>> from
>> UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
>> bytes back to UTF-16 (which the String object uses) by using Cp1252 
>> charset.
>>
>> same for the second attempt.
>>
>> What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
>> bytes? It's not going to be a round-trip conversion for those 
>> non-ASCII characters.
> Thanks Sherman for your explanation.
>
> The problem occured when I was using JDBC with MySQL. The former 
> application has stored the utf8 data to a default configured database 
> ( with encoding is latin1 ), and get the data and decode in PHP is OK. 
> But I failed in java when reading the data. From the document( 
> http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-charsets.html 
> ), latin1 in MySQL corresponds with Cp1252 in JAVA, so I found the 
> cause, and I deem the guy here also encountered this problem ( 
> http://forums.mysql.com/read.php?39,228068,228068#msg-228068 ).
>
> As since the data in latin1(in java) can be converted to utf8 freely 
> and vice versa. From the wikipedia Cp1252 is treated as a superset a 
> ISO_8859-1, so I guess the same expectation on Cp1252 as latin1 is 
> something natural, though it does not work now.
>
> However, YMMV, would you mind give some suggestions on this?  Thanks 
> in advance.
>
> Eric

Windows-1252 (cp1252) is a supperset of ISO 8859-1. ISO 8859-1 is 
normally referred as
the latin-1. What we have in Java charset repository is ISO-8859-1. The 
difference between
ISO 8859-1 and ISO-8859-1 (with dash and without dash) is the C0 and C1 
control character
area. ISO-8859-1 has the C0 and C1 defined, ISO 8859-1 does not.

So in your above workaround, you'd better use ISO-8859-1, stead of cp1252.

I know little about JDBC + MySQL,  so probably not the one to give 
suggestion on this topic.
By simply reading the description of the problem you are facing with, I 
guess you'd better
to set your client side encoding/charset correctly to utf-8 or gbk to 
receive result in Chinese
correctly.

-Sherman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/jdk6-dev/attachments/20110902/cec4afe6/attachment.html 


More information about the jdk6-dev mailing list