Rewrite of IBM doublebyte charsets
Ulf Zibis
Ulf.Zibis at gmx.de
Tue May 19 00:37:38 UTC 2009
Am 14.05.2009 23:38, Xueming Shen schrieb:
> Ulf,
>
> There are 3 goals of this re-writing
> (1)shrink the storage size of the EUC_TW to a reasonable number
> (2)move away from hard-coding the mapping data in the source file to a
> mapping based-build time built approach
> for easy maintenance in the future.
> (3)no regression on decoding, encoding performance, decoder startup
> and resulting CoderResult when compared
> to the existing implementation, with the exception of encoder startup
> (we need to build it from the b2c).
>
> So far I'm happy to see all of them are archived. I'm not targeting to
> have a perfect one (actually the purpose of
> goal of (2) is to make it easier for future tuning.).
Yes, the map files are good start point for future tuning.
>
> I would not try to argue which cr is more appropriate, unmappable or
> malformed, it's hard to draw the line, some
> codepage/charset set leave some codepoint for future use, private use,
> user-defined characters, you can't not make
> the decision based on simply looking at the mapping table, you need to
> have a standard on your desk to check
> segment by segment, and in fact personally I don't think it really
> makes too much sense to distinguish these two. So
> I would like to follow the existing behavior, is possible.
>
Mainly I agree with you and I guess, most users don't care about this
difference, so the wouldn't run into compatibility problems, if only
checking CoderResult#isError(), but I think, that users, who are
interested in this difference, they should get most accurate results,
regardless, if former implementations have been malicious.
Hope, you are inspired by my suggestions from yesterday ;-)
-Ulf
More information about the core-libs-dev
mailing list