Rewrite of IBM doublebyte charsets

Ulf Zibis Ulf.Zibis at gmx.de
Thu May 14 20:14:30 UTC 2009


Now I have time to answer more detailed ...

Am 12.05.2009 08:30, Xueming Shen schrieb:
> For (2), I'm not convinced that this approach is an appropriate one 
> for a complicated charset like EUC_TW,
> given the number of array it carries, the recovery work (to trace back 
> to what goes wrong and then return the
> appropriate CoderResult) will be complicated and redundant...).

Well, checking the range twice is also redundant (It's additionally 
checked behind the scenes on every array access by JVM).

> This might have a benefit of saving the range
> check (I don't have any data to show how much we can gain from doing 
> this, only a guess), but given almost all
> segments are near "full", I don't see the benefit on the footprint 
> saving side. We need some hard data to support
> this approach, which I don't have for now. I would leave this one for 
> you for further optimization in your project.

Yes, that's good idea. I would be happy, if it would be launched in the 
near future ...

>
> I have updated the webrev to address some of your other optimization 
> suggestions
>

Happy to see that. :-)

>
> (1)No I don't think we want to save the supplementary into surrogate 
> pair, this is what I'm trying to fix. We don't
> care the performance of surrogates, those codepoints are RARE used, 
> 99%+ coding/decoding happens in
> BMP, we did not have the supplementary characters for the first couple 
> years. (OK, I'm a native, I don't think
> I can even read those characters)

This is, what I didn't know. My assumption was, that those supplementary 
characters would be regularly used, as they are 137 % of BMP chars count.
But if they are so rare used, wouldn't it be reasonable to split the 
mapping into 2 chunks, or even 3 chunks, having a base-chunk of about 
~10 % of BMP. Your native status would help to discover those ~10 %. ;-)
Well, such optimization would ideally placed in the mentioned project.

>
> (2)The initialization c2b data for encoder has already been "lazied" 
> until Encoder class gets loaded.

Oops, I oversaw this fact. ;-)


-Ulf





More information about the core-libs-dev mailing list