<i18n dev> Fwd: Some differences on Window UDC area

Sun Jun 3 23:52:03 PDT 2012

Hi

It has been confirmed (thanks Jacky) that Solaris' iconv GBK actually 
uses (shares) the GB18030
mappings as showed at

http://src.opensolaris.org/source/xref/nv-g11n/g11n/src/lib/iconv/inc/unicode_gb18030.h

which is same as the MS936 for that particular region, with one 
exception, the euro sign.
GB18030 maps U+20AC to 0xA2E3.

As the result the GBK also follows the GB18030's euro mapping as A2E3 
<-> 20AC.

I think Java GBK should follow this as well. Here is the "final" webrev

http://cr.openjdk.java.net/~sherman/6183404/webrev

which includes the update for both MS936 and GBK.

-Sherman

On 5/31/2012 9:47 PM, Xueming Shen wrote:
>
>
> On 5/31/2012 8:04 PM, Charles Lee wrote:
>> Hi Sherman,
>>
>> Thank you for bring these out. The change is great because MS936.map 
>> is the same as mine :-)
>>
>> What about GBK.map?
>
> Given how those code points are mapped in GB18030, I would assume they 
> probably
> should be updated as well. But I'm confirming with our Solaris people 
> to get the
> mapping table used in their iconv.
>
> -Sherman
>
>>
>> On 05/31/2012 03:25 PM, Xueming Shen wrote:
>>> Hi,
>>>
>>> Here is the webrev for the updated MS936.map change, which updated
>>> the mapping entries for 500+ EUDC code points  with in range of A140-
>>> A7A0. I'm using CR#6183404
>>>
>>> http://cr.openjdk.java.net/~sherman/6183404/webrev
>>>
>>> I re-generated the MS936.b2c and c2b mapping tables via
>>> MultiByteToWideChar and WideCharToMultiByte as showed in ms936.c
>>> below.
>>>
>>> http://cr.openjdk.java.net/~sherman/6183404/ms936.c
>>>
>>> I went through the diff of the newly generated b2c table and the
>>> existing MS936.map, it appears the two tables are identical except
>>> the 500+ code points of  EUDC(PUA)  with in range 0xA140-0xA7A0.
>>>
>>> You can check the "defined" and "undefined" ms936 code points at
>>> http://msdn.microsoft.com/en-US/goglobal/cc305153
>>> (click the A1 - A7)
>>>
>>> The mapping from FUSE at jp.ibm.com (integrated into JDK1.3/1999 via
>>> CR#4202893) fills all "user-defined"/undefined code points in this
>>> range ( 0xA140 - 0xA7A0) with the code points from Unicode PUA
>>> starting from U-E4C6 to U-E79F one by one sequentially (in code
>>> point order). However the newly generated mapping table from
>>> MultiByteToWideChar and WideCharToMultiByte suggests the actually
>>> mapping is to fill the big continuing area first with code points 
>>> starting
>>> from U+E4C6 (sequentially)
>>>
>>> 0xA140-A1A0     ->   U+E4C6 - U+E525
>>> 0xA240-A2A0    ->    U+E526 - U+E585
>>> 0xA340-A3A0    ->    U+E586 - U+E5E5
>>> ...
>>> 0xA740-A7A0   ->     U+E706 - U+E765
>>>
>>> then it goes back to fill those "small"/leftover area/spot with the PUA
>>> code points started from U+E766, the first is
>>>
>>> 0xA2AB    -> U+E766
>>> ...
>>> 0xA6FE    -> U+E79F
>>>
>>> This pattern can be easily observed at
>>> http://cr.openjdk.java.net/~sherman/6183404/webrev/make/tools/CharsetMapping/MS936.map.sdiff.html 
>>>
>>>
>>> Now the new MS936.map is identical to the mapping used by
>>> wctomb and mbtowc, the only exception is the 0xff <-> u+F8F5,
>>> which is excluded for now, personally I don't feel comfortable
>>> it in.
>>>
>>> #6183404 also complains some 412 non-UDC characters missing from 
>>> Java MS936,
>>> all these characters are listed at
>>> http://cr.openjdk.java.net/~sherman/6183404/CodePage936.pdf
>>> A careful check suggested these are the result of incorrect use of
>>> WideCharToMultiByte when generating the mapping, it appears
>>> these entries are "best fit" result  from WideCharToMultiByte when
>>> WC_NO_BEST_FIT_CHARS flag is not specified.
>>>
>>> There might be a compatibility concern of changing these entries, but
>>> given (1) they are educ/pua characters/code points (2)it follows
>>> MS, and this is a MS charset, I don't think this should stop the
>>> update.
>>>
>>> OK,  this is all I got. Please help review (Masoyoshi, Charles)
>>>
>>> Thanks,
>>> -Sherman
>>>
>>>
>>>
>>>
>>
>>