<i18n dev> Fwd: Some differences on Window UDC area

Thu May 31 20:04:27 PDT 2012

Hi Sherman,

Thank you for bring these out. The change is great because MS936.map is 
the same as mine :-)

What about GBK.map?

On 05/31/2012 03:25 PM, Xueming Shen wrote:
> Hi,
>
> Here is the webrev for the updated MS936.map change, which updated
> the mapping entries for 500+ EUDC code points  with in range of A140-
> A7A0. I'm using CR#6183404
>
> http://cr.openjdk.java.net/~sherman/6183404/webrev
>
> I re-generated the MS936.b2c and c2b mapping tables via
> MultiByteToWideChar and WideCharToMultiByte as showed in ms936.c
> below.
>
> http://cr.openjdk.java.net/~sherman/6183404/ms936.c
>
> I went through the diff of the newly generated b2c table and the
> existing MS936.map, it appears the two tables are identical except
> the 500+ code points of  EUDC(PUA)  with in range 0xA140-0xA7A0.
>
> You can check the "defined" and "undefined" ms936 code points at
> http://msdn.microsoft.com/en-US/goglobal/cc305153
> (click the A1 - A7)
>
> The mapping from FUSE at jp.ibm.com (integrated into JDK1.3/1999 via
> CR#4202893) fills all "user-defined"/undefined code points in this
> range ( 0xA140 - 0xA7A0) with the code points from Unicode PUA
> starting from U-E4C6 to U-E79F one by one sequentially (in code
> point order). However the newly generated mapping table from
> MultiByteToWideChar and WideCharToMultiByte suggests the actually
> mapping is to fill the big continuing area first with code points 
> starting
> from U+E4C6 (sequentially)
>
> 0xA140-A1A0     ->   U+E4C6 - U+E525
> 0xA240-A2A0    ->    U+E526 - U+E585
> 0xA340-A3A0    ->    U+E586 - U+E5E5
> ...
> 0xA740-A7A0   ->     U+E706 - U+E765
>
> then it goes back to fill those "small"/leftover area/spot with the PUA
> code points started from U+E766, the first is
>
> 0xA2AB    -> U+E766
> ...
> 0xA6FE    -> U+E79F
>
> This pattern can be easily observed at
> http://cr.openjdk.java.net/~sherman/6183404/webrev/make/tools/CharsetMapping/MS936.map.sdiff.html 
>
>
> Now the new MS936.map is identical to the mapping used by
> wctomb and mbtowc, the only exception is the 0xff <-> u+F8F5,
> which is excluded for now, personally I don't feel comfortable
> it in.
>
> #6183404 also complains some 412 non-UDC characters missing from Java 
> MS936,
> all these characters are listed at
> http://cr.openjdk.java.net/~sherman/6183404/CodePage936.pdf
> A careful check suggested these are the result of incorrect use of
> WideCharToMultiByte when generating the mapping, it appears
> these entries are "best fit" result  from WideCharToMultiByte when
> WC_NO_BEST_FIT_CHARS flag is not specified.
>
> There might be a compatibility concern of changing these entries, but
> given (1) they are educ/pua characters/code points (2)it follows
> MS, and this is a MS charset, I don't think this should stop the
> update.
>
> OK,  this is all I got. Please help review (Masoyoshi, Charles)
>
> Thanks,
> -Sherman
>
>
>
>

-- 
Yours Charles