<i18n dev> Fwd: Some differences on Window UDC area

Xueming Shen xueming.shen at oracle.com
Thu May 31 21:47:26 PDT 2012



On 5/31/2012 8:04 PM, Charles Lee wrote:
> Hi Sherman,
>
> Thank you for bring these out. The change is great because MS936.map 
> is the same as mine :-)
>
> What about GBK.map?

Given how those code points are mapped in GB18030, I would assume they 
probably
should be updated as well. But I'm confirming with our Solaris people to 
get the
mapping table used in their iconv.

-Sherman

>
> On 05/31/2012 03:25 PM, Xueming Shen wrote:
>> Hi,
>>
>> Here is the webrev for the updated MS936.map change, which updated
>> the mapping entries for 500+ EUDC code points  with in range of A140-
>> A7A0. I'm using CR#6183404
>>
>> http://cr.openjdk.java.net/~sherman/6183404/webrev
>>
>> I re-generated the MS936.b2c and c2b mapping tables via
>> MultiByteToWideChar and WideCharToMultiByte as showed in ms936.c
>> below.
>>
>> http://cr.openjdk.java.net/~sherman/6183404/ms936.c
>>
>> I went through the diff of the newly generated b2c table and the
>> existing MS936.map, it appears the two tables are identical except
>> the 500+ code points of  EUDC(PUA)  with in range 0xA140-0xA7A0.
>>
>> You can check the "defined" and "undefined" ms936 code points at
>> http://msdn.microsoft.com/en-US/goglobal/cc305153
>> (click the A1 - A7)
>>
>> The mapping from FUSE at jp.ibm.com (integrated into JDK1.3/1999 via
>> CR#4202893) fills all "user-defined"/undefined code points in this
>> range ( 0xA140 - 0xA7A0) with the code points from Unicode PUA
>> starting from U-E4C6 to U-E79F one by one sequentially (in code
>> point order). However the newly generated mapping table from
>> MultiByteToWideChar and WideCharToMultiByte suggests the actually
>> mapping is to fill the big continuing area first with code points 
>> starting
>> from U+E4C6 (sequentially)
>>
>> 0xA140-A1A0     ->   U+E4C6 - U+E525
>> 0xA240-A2A0    ->    U+E526 - U+E585
>> 0xA340-A3A0    ->    U+E586 - U+E5E5
>> ...
>> 0xA740-A7A0   ->     U+E706 - U+E765
>>
>> then it goes back to fill those "small"/leftover area/spot with the PUA
>> code points started from U+E766, the first is
>>
>> 0xA2AB    -> U+E766
>> ...
>> 0xA6FE    -> U+E79F
>>
>> This pattern can be easily observed at
>> http://cr.openjdk.java.net/~sherman/6183404/webrev/make/tools/CharsetMapping/MS936.map.sdiff.html 
>>
>>
>> Now the new MS936.map is identical to the mapping used by
>> wctomb and mbtowc, the only exception is the 0xff <-> u+F8F5,
>> which is excluded for now, personally I don't feel comfortable
>> it in.
>>
>> #6183404 also complains some 412 non-UDC characters missing from Java 
>> MS936,
>> all these characters are listed at
>> http://cr.openjdk.java.net/~sherman/6183404/CodePage936.pdf
>> A careful check suggested these are the result of incorrect use of
>> WideCharToMultiByte when generating the mapping, it appears
>> these entries are "best fit" result  from WideCharToMultiByte when
>> WC_NO_BEST_FIT_CHARS flag is not specified.
>>
>> There might be a compatibility concern of changing these entries, but
>> given (1) they are educ/pua characters/code points (2)it follows
>> MS, and this is a MS charset, I don't think this should stop the
>> update.
>>
>> OK,  this is all I got. Please help review (Masoyoshi, Charles)
>>
>> Thanks,
>> -Sherman
>>
>>
>>
>>
>
>


More information about the i18n-dev mailing list