Rewrite of IBM doublebyte charsets

Tue May 19 00:37:38 UTC 2009

Am 14.05.2009 23:38, Xueming Shen schrieb:
> Ulf,
>
> There are 3 goals of this re-writing
> (1)shrink the storage size of the EUC_TW to a reasonable number
> (2)move away from hard-coding the mapping data in the source file to a 
> mapping based-build time built approach
> for easy maintenance in the future.
> (3)no regression on decoding, encoding performance, decoder startup 
> and resulting CoderResult when compared
> to the existing implementation, with the exception of encoder startup 
> (we need to build it from the b2c).
>
> So far I'm happy to see all of them are archived. I'm not targeting to 
> have a perfect one (actually the purpose of
> goal of (2) is to make it easier for future tuning.).

Yes, the map files are good start point for future tuning.

>
> I would not try to argue which cr is more appropriate, unmappable or 
> malformed, it's hard to draw the line, some
> codepage/charset set leave some codepoint for future use, private use, 
> user-defined characters, you can't not make
> the decision based on simply looking at the mapping table, you need to 
> have a standard on your desk to check
> segment by segment, and in fact personally I don't think it really 
> makes too much sense to distinguish these two. So
> I would like to follow the existing behavior, is possible.
>

Mainly I agree with you and I guess, most users don't care about this 
difference, so the wouldn't run into compatibility problems, if only 
checking CoderResult#isError(), but I think, that users, who are 
interested in this difference, they should get most accurate results, 
regardless, if former implementations have been malicious.

Hope, you are inspired by my suggestions from yesterday ;-)

-Ulf