Maybe codec bug in MS1252, i.e., encoding Cp1252

Mon Sep 5 00:56:35 PDT 2011

On 09/03/2011 03:50 AM, Xueming Shen wrote:
> On 09/02/2011 02:14 AM, Eric Liang wrote:
>> On 09/02/2011 04:04 AM, Xueming Shen wrote:
>>> Hi,
>>>
>>> These 5 code points are "undefined" character in Cp1252. The first one
>>> should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
>>> correctly in Cp1252 charset). The mapping table you referred to is
>>> "bestfit" type mapping table, in which it tries to provide the mapping
>>> between the local encoding and the Unicode character set for those
>>> characters not even exist in the local encoding. Personally I don't
>>> think
>>> it's a good idea in most use scenario. All other official (from
>>> Microsoft)
>>> or un-official mapping tables clearly mark these code points "undefined"
>>> or "unused", for example
>>>
>>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>>> http://en.wikipedia.org/wiki/Windows-1252
>>> http://msdn.microsoft.com/en-us/library/cc195054.aspx
>>>
>>> btw,  code below is incorrect,  or it does not work the way you
>>> might expect.
>>>
>>> String name1 = new String( new String("兆源").getBytes("UTF-8"),
>>> "Cp1252");
>>> String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
>>>
>>> new String("兆源").getBytes("UTF-8") encodes your 2 Chinese
>>> character from
>>> UTF-16 to UTF-8 bytes. It does not makes sense to then decode these
>>> UTF-8
>>> bytes back to UTF-16 (which the String object uses) by using Cp1252
>>> charset.
>>>
>>> same for the second attempt.
>>>
>>> What did you try to achieve? decode/encode between UTF-8 bytes and
>>> CP1252
>>> bytes? It's not going to be a round-trip conversion for those
>>> non-ASCII characters.
>> Thanks Sherman for your explanation.
>>
>> The problem occured when I was using JDBC with MySQL. The former
>> application has stored the utf8 data to a default configured database
>> ( with encoding is latin1 ), and get the data and decode in PHP is
>> OK. But I failed in java when reading the data. From the document(
>> http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-charsets.html
>> ), latin1 in MySQL corresponds with Cp1252 in JAVA, so I found the
>> cause, and I deem the guy here also encountered this problem (
>> http://forums.mysql.com/read.php?39,228068,228068#msg-228068 ).
>>
>> As since the data in latin1(in java) can be converted to utf8 freely
>> and vice versa. From the wikipedia Cp1252 is treated as a superset a
>> ISO_8859-1, so I guess the same expectation on Cp1252 as latin1 is
>> something natural, though it does not work now.
>>
>> However, YMMV, would you mind give some suggestions on this?  Thanks
>> in advance.
>>
>> Eric
>
> Windows-1252 (cp1252) is a supperset of ISO 8859-1. ISO 8859-1 is
> normally referred as
> the latin-1. What we have in Java charset repository is ISO-8859-1.
> The difference between
> ISO 8859-1 and ISO-8859-1 (with dash and without dash) is the C0 and
> C1 control character
> area. ISO-8859-1 has the C0 and C1 defined, ISO 8859-1 does not.
>
> So in your above workaround, you'd better use ISO-8859-1, stead of cp1252.
Thank you for your patience.

Besides the doc mentioned above, these codes indicates that Cp1252 is a
formal choice:

    String encoding =
    CharsetMapping.getJavaEncodingForMysqlEncoding("latin1",
    (com.mysql.jdbc.Connection) conn);
    System.out.println("Encoding for mysql: "+encoding);

And I notice that in class StandardCharsets, there is an alias of latin1
is ISO-8859-1:

    ht[429] = new Object[] { "latin1", "iso-8859-1" };

However, I do have tried ISO-8859-1, still does not work.

>
> I know little about JDBC + MySQL,  so probably not the one to give
> suggestion on this topic.
> By simply reading the description of the problem you are facing with,
> I guess you'd better
> to set your client side encoding/charset correctly to utf-8 or gbk to
> receive result in Chinese
> correctly.
>
This was tested for some of my colleagues and I, it is not the right
configuration in this case.

I'd like to post the test codes If you were interested. And any other
suggestions is also appreciated.

Thanks,
Eric

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCM/CS/E/MU/P d+(-) s: a- C++ UL$ P+>++ L++ E++ W++ N+ o+>++ K+++ w !O
M-(+) V-- PS+ PE+ Y+ PGP++ t? 5? X? R+>* tv@ b++++ DI-- D G++ e++>+++@ h*
r !y+
------END GEEK CODE BLOCK------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/jdk6-dev/attachments/20110905/0f8a2023/attachment.html