<i18n dev> Fwd: Some differences on Window UDC area
Charles Lee
littlee at linux.vnet.ibm.com
Thu May 31 20:04:27 PDT 2012
Hi Sherman,
Thank you for bring these out. The change is great because MS936.map is
the same as mine :-)
What about GBK.map?
On 05/31/2012 03:25 PM, Xueming Shen wrote:
> Hi,
>
> Here is the webrev for the updated MS936.map change, which updated
> the mapping entries for 500+ EUDC code points with in range of A140-
> A7A0. I'm using CR#6183404
>
> http://cr.openjdk.java.net/~sherman/6183404/webrev
>
> I re-generated the MS936.b2c and c2b mapping tables via
> MultiByteToWideChar and WideCharToMultiByte as showed in ms936.c
> below.
>
> http://cr.openjdk.java.net/~sherman/6183404/ms936.c
>
> I went through the diff of the newly generated b2c table and the
> existing MS936.map, it appears the two tables are identical except
> the 500+ code points of EUDC(PUA) with in range 0xA140-0xA7A0.
>
> You can check the "defined" and "undefined" ms936 code points at
> http://msdn.microsoft.com/en-US/goglobal/cc305153
> (click the A1 - A7)
>
> The mapping from FUSE at jp.ibm.com (integrated into JDK1.3/1999 via
> CR#4202893) fills all "user-defined"/undefined code points in this
> range ( 0xA140 - 0xA7A0) with the code points from Unicode PUA
> starting from U-E4C6 to U-E79F one by one sequentially (in code
> point order). However the newly generated mapping table from
> MultiByteToWideChar and WideCharToMultiByte suggests the actually
> mapping is to fill the big continuing area first with code points
> starting
> from U+E4C6 (sequentially)
>
> 0xA140-A1A0 -> U+E4C6 - U+E525
> 0xA240-A2A0 -> U+E526 - U+E585
> 0xA340-A3A0 -> U+E586 - U+E5E5
> ...
> 0xA740-A7A0 -> U+E706 - U+E765
>
> then it goes back to fill those "small"/leftover area/spot with the PUA
> code points started from U+E766, the first is
>
> 0xA2AB -> U+E766
> ...
> 0xA6FE -> U+E79F
>
> This pattern can be easily observed at
> http://cr.openjdk.java.net/~sherman/6183404/webrev/make/tools/CharsetMapping/MS936.map.sdiff.html
>
>
> Now the new MS936.map is identical to the mapping used by
> wctomb and mbtowc, the only exception is the 0xff <-> u+F8F5,
> which is excluded for now, personally I don't feel comfortable
> it in.
>
> #6183404 also complains some 412 non-UDC characters missing from Java
> MS936,
> all these characters are listed at
> http://cr.openjdk.java.net/~sherman/6183404/CodePage936.pdf
> A careful check suggested these are the result of incorrect use of
> WideCharToMultiByte when generating the mapping, it appears
> these entries are "best fit" result from WideCharToMultiByte when
> WC_NO_BEST_FIT_CHARS flag is not specified.
>
> There might be a compatibility concern of changing these entries, but
> given (1) they are educ/pua characters/code points (2)it follows
> MS, and this is a MS charset, I don't think this should stop the
> update.
>
> OK, this is all I got. Please help review (Masoyoshi, Charles)
>
> Thanks,
> -Sherman
>
>
>
>
--
Yours Charles
More information about the i18n-dev
mailing list