<i18n dev> Fwd: Some differences on Window UDC area
Xueming Shen
xueming.shen at oracle.com
Thu May 31 00:25:05 PDT 2012
Hi,
Here is the webrev for the updated MS936.map change, which updated
the mapping entries for 500+ EUDC code points with in range of A140-
A7A0. I'm using CR#6183404
http://cr.openjdk.java.net/~sherman/6183404/webrev
I re-generated the MS936.b2c and c2b mapping tables via
MultiByteToWideChar and WideCharToMultiByte as showed in ms936.c
below.
http://cr.openjdk.java.net/~sherman/6183404/ms936.c
I went through the diff of the newly generated b2c table and the
existing MS936.map, it appears the two tables are identical except
the 500+ code points of EUDC(PUA) with in range 0xA140-0xA7A0.
You can check the "defined" and "undefined" ms936 code points at
http://msdn.microsoft.com/en-US/goglobal/cc305153
(click the A1 - A7)
The mapping from FUSE at jp.ibm.com (integrated into JDK1.3/1999 via
CR#4202893) fills all "user-defined"/undefined code points in this
range ( 0xA140 - 0xA7A0) with the code points from Unicode PUA
starting from U-E4C6 to U-E79F one by one sequentially (in code
point order). However the newly generated mapping table from
MultiByteToWideChar and WideCharToMultiByte suggests the actually
mapping is to fill the big continuing area first with code points starting
from U+E4C6 (sequentially)
0xA140-A1A0 -> U+E4C6 - U+E525
0xA240-A2A0 -> U+E526 - U+E585
0xA340-A3A0 -> U+E586 - U+E5E5
...
0xA740-A7A0 -> U+E706 - U+E765
then it goes back to fill those "small"/leftover area/spot with the PUA
code points started from U+E766, the first is
0xA2AB -> U+E766
...
0xA6FE -> U+E79F
This pattern can be easily observed at
http://cr.openjdk.java.net/~sherman/6183404/webrev/make/tools/CharsetMapping/MS936.map.sdiff.html
Now the new MS936.map is identical to the mapping used by
wctomb and mbtowc, the only exception is the 0xff <-> u+F8F5,
which is excluded for now, personally I don't feel comfortable
it in.
#6183404 also complains some 412 non-UDC characters missing from Java MS936,
all these characters are listed at
http://cr.openjdk.java.net/~sherman/6183404/CodePage936.pdf
A careful check suggested these are the result of incorrect use of
WideCharToMultiByte when generating the mapping, it appears
these entries are "best fit" result from WideCharToMultiByte when
WC_NO_BEST_FIT_CHARS flag is not specified.
There might be a compatibility concern of changing these entries, but
given (1) they are educ/pua characters/code points (2)it follows
MS, and this is a MS charset, I don't think this should stop the
update.
OK, this is all I got. Please help review (Masoyoshi, Charles)
Thanks,
-Sherman
More information about the i18n-dev
mailing list