Rewrite of IBM doublebyte charsets
Ulf Zibis
Ulf.Zibis at gmx.de
Thu May 14 20:14:30 UTC 2009
Now I have time to answer more detailed ...
Am 12.05.2009 08:30, Xueming Shen schrieb:
> For (2), I'm not convinced that this approach is an appropriate one
> for a complicated charset like EUC_TW,
> given the number of array it carries, the recovery work (to trace back
> to what goes wrong and then return the
> appropriate CoderResult) will be complicated and redundant...).
Well, checking the range twice is also redundant (It's additionally
checked behind the scenes on every array access by JVM).
> This might have a benefit of saving the range
> check (I don't have any data to show how much we can gain from doing
> this, only a guess), but given almost all
> segments are near "full", I don't see the benefit on the footprint
> saving side. We need some hard data to support
> this approach, which I don't have for now. I would leave this one for
> you for further optimization in your project.
Yes, that's good idea. I would be happy, if it would be launched in the
near future ...
>
> I have updated the webrev to address some of your other optimization
> suggestions
>
Happy to see that. :-)
>
> (1)No I don't think we want to save the supplementary into surrogate
> pair, this is what I'm trying to fix. We don't
> care the performance of surrogates, those codepoints are RARE used,
> 99%+ coding/decoding happens in
> BMP, we did not have the supplementary characters for the first couple
> years. (OK, I'm a native, I don't think
> I can even read those characters)
This is, what I didn't know. My assumption was, that those supplementary
characters would be regularly used, as they are 137 % of BMP chars count.
But if they are so rare used, wouldn't it be reasonable to split the
mapping into 2 chunks, or even 3 chunks, having a base-chunk of about
~10 % of BMP. Your native status would help to discover those ~10 %. ;-)
Well, such optimization would ideally placed in the mentioned project.
>
> (2)The initialization c2b data for encoder has already been "lazied"
> until Encoder class gets loaded.
Oops, I oversaw this fact. ;-)
-Ulf
More information about the core-libs-dev
mailing list