<i18n dev> Fwd: Some differences on Window UDC area

Charles Lee littlee at linux.vnet.ibm.com
Mon Jun 4 23:44:53 PDT 2012


Hi Sherman,

The patch looks great.

On 06/04/2012 02:52 PM, Xueming Shen wrote:
> Hi
>
> It has been confirmed (thanks Jacky) that Solaris' iconv GBK actually 
> uses (shares) the GB18030
> mappings as showed at
>
> http://src.opensolaris.org/source/xref/nv-g11n/g11n/src/lib/iconv/inc/unicode_gb18030.h 
>
>
> which is same as the MS936 for that particular region, with one 
> exception, the euro sign.
> GB18030 maps U+20AC to 0xA2E3.
>
> As the result the GBK also follows the GB18030's euro mapping as A2E3 
> <-> 20AC.
>
> I think Java GBK should follow this as well. Here is the "final" webrev
>
> http://cr.openjdk.java.net/~sherman/6183404/webrev
>
> which includes the update for both MS936 and GBK.
>
> -Sherman
>
>
>
> On 5/31/2012 9:47 PM, Xueming Shen wrote:
>>
>>
>> On 5/31/2012 8:04 PM, Charles Lee wrote:
>>> Hi Sherman,
>>>
>>> Thank you for bring these out. The change is great because MS936.map 
>>> is the same as mine :-)
>>>
>>> What about GBK.map?
>>
>> Given how those code points are mapped in GB18030, I would assume 
>> they probably
>> should be updated as well. But I'm confirming with our Solaris people 
>> to get the
>> mapping table used in their iconv.
>>
>> -Sherman
>>
>>>
>>> On 05/31/2012 03:25 PM, Xueming Shen wrote:
>>>> Hi,
>>>>
>>>> Here is the webrev for the updated MS936.map change, which updated
>>>> the mapping entries for 500+ EUDC code points  with in range of A140-
>>>> A7A0. I'm using CR#6183404
>>>>
>>>> http://cr.openjdk.java.net/~sherman/6183404/webrev
>>>>
>>>> I re-generated the MS936.b2c and c2b mapping tables via
>>>> MultiByteToWideChar and WideCharToMultiByte as showed in ms936.c
>>>> below.
>>>>
>>>> http://cr.openjdk.java.net/~sherman/6183404/ms936.c
>>>>
>>>> I went through the diff of the newly generated b2c table and the
>>>> existing MS936.map, it appears the two tables are identical except
>>>> the 500+ code points of  EUDC(PUA)  with in range 0xA140-0xA7A0.
>>>>
>>>> You can check the "defined" and "undefined" ms936 code points at
>>>> http://msdn.microsoft.com/en-US/goglobal/cc305153
>>>> (click the A1 - A7)
>>>>
>>>> The mapping from FUSE at jp.ibm.com (integrated into JDK1.3/1999 via
>>>> CR#4202893) fills all "user-defined"/undefined code points in this
>>>> range ( 0xA140 - 0xA7A0) with the code points from Unicode PUA
>>>> starting from U-E4C6 to U-E79F one by one sequentially (in code
>>>> point order). However the newly generated mapping table from
>>>> MultiByteToWideChar and WideCharToMultiByte suggests the actually
>>>> mapping is to fill the big continuing area first with code points 
>>>> starting
>>>> from U+E4C6 (sequentially)
>>>>
>>>> 0xA140-A1A0     ->   U+E4C6 - U+E525
>>>> 0xA240-A2A0    ->    U+E526 - U+E585
>>>> 0xA340-A3A0    ->    U+E586 - U+E5E5
>>>> ...
>>>> 0xA740-A7A0   ->     U+E706 - U+E765
>>>>
>>>> then it goes back to fill those "small"/leftover area/spot with the 
>>>> PUA
>>>> code points started from U+E766, the first is
>>>>
>>>> 0xA2AB    -> U+E766
>>>> ...
>>>> 0xA6FE    -> U+E79F
>>>>
>>>> This pattern can be easily observed at
>>>> http://cr.openjdk.java.net/~sherman/6183404/webrev/make/tools/CharsetMapping/MS936.map.sdiff.html 
>>>>
>>>>
>>>> Now the new MS936.map is identical to the mapping used by
>>>> wctomb and mbtowc, the only exception is the 0xff <-> u+F8F5,
>>>> which is excluded for now, personally I don't feel comfortable
>>>> it in.
>>>>
>>>> #6183404 also complains some 412 non-UDC characters missing from 
>>>> Java MS936,
>>>> all these characters are listed at
>>>> http://cr.openjdk.java.net/~sherman/6183404/CodePage936.pdf
>>>> A careful check suggested these are the result of incorrect use of
>>>> WideCharToMultiByte when generating the mapping, it appears
>>>> these entries are "best fit" result  from WideCharToMultiByte when
>>>> WC_NO_BEST_FIT_CHARS flag is not specified.
>>>>
>>>> There might be a compatibility concern of changing these entries, but
>>>> given (1) they are educ/pua characters/code points (2)it follows
>>>> MS, and this is a MS charset, I don't think this should stop the
>>>> update.
>>>>
>>>> OK,  this is all I got. Please help review (Masoyoshi, Charles)
>>>>
>>>> Thanks,
>>>> -Sherman
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>


-- 
Yours Charles



More information about the i18n-dev mailing list