Rewrite of IBM doublebyte charsets

Tue May 12 06:30:53 UTC 2009

Ulf, thanks for taking time reviewing the changes.

It seems most of the suggestions are based on two approaches you adopted 
in your charset-improvement project

(1)extract the mapping data into a binary data file and load in the data 
as the resource file during initialization of
charset decoder/encoder

(2)totally depends on catching the IndexOutOfBoundsException to avoid 
sanity check on any array index when access
the array data

For (1), as I explained last time,  the 30% more size saving is 
attractive,  but the magnitude of the charset startup/
initialization regression (reading data from the resource inputstream is 
much slower than loading in the data as a constant
String object of a class file) is not acceptable. I do have a version of 
implementation that goes that approach, I even
sent the webrev to Alan and Masayoshi for review, I "recalled" the 
review request after realized the startup regression.
That was 6 months ago (I uploaded the webrev at 
http://cr.openjdk.java.net/~sherman/Old_newEUC_TW.)
So for now it's not a choice for me to extract the data out.

For (2), I'm not convinced that this approach is an appropriate one for 
a complicated charset like EUC_TW,
given the number of array it carries, the recovery work (to trace back 
to what goes wrong and then return the
appropriate CoderResult) will be complicated and redundant...). This 
might have a benefit of saving the range
check (I don't have any data to show how much we can gain from doing 
this, only a guess), but given almost all
segments are near "full", I don't see the benefit on the footprint 
saving side. We need some hard data to support
this approach, which I don't have for now. I would leave this one for 
you for further optimization in your project.

I have updated the webrev to address some of your other optimization 
suggestions

(1)simplify the "plane number" byte check by adding a new static array 
of cnspToIndex[16] for decoder
(2)only do the isLeagalDB check when lookup failed (assume most bytes 
fed are legal...) for decoder
(3)save plane no as 0x0, 0x2...0x7 and 0xf  for encoder (this helps save 
the plane number check during
encoding).

Something I don't think I'm going to do

(1)No I don't think we want to save the supplementary into surrogate 
pair, this is what I'm trying to fix. We don't
care the performance of surrogates, those codepoints are RARE used, 99%+ 
coding/decoding happens in
BMP, we did not have the supplementary characters for the first couple 
years. (OK, I'm a native, I don't think
I can even read those characters)

(2)The initialization c2b data for encoder has already been "lazied" 
until Encoder class gets loaded.

(3)I doubt use BitSet for b2cIsSupp will save any memory, or faster. 
b2cIsSupp has been fully utilized to
use each/every bit of its storage.

I gave the latest version a "quick" run, I'm happy with the performance 
gain, especially the encoding.

-client
    Encoding TimeRatio EUC_TW_OLD/x-EUC-TW: 84927,10151 :8.366368d
    Decoding TimeRatio EUC_TW_OLD/x-EUC-TW: 13162,9237 :1.424922d
    Encoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 94829,19950 :4.753333d
    Decoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 21884,17992 :1.216318d

-server

    Encoding TimeRatio EUC_TW_OLD/x-EUC-TW: 74651,12218 :6.109920d
    Decoding TimeRatio EUC_TW_OLD/x-EUC-TW: 11686,10364 :1.127557d
    Encoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 79058,12187 :6.487076d
    Decoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 13518,10198 :1.325554d

The latest webrev is at

http://cr.openjdk.java.net/~sherman/6831794_6229811/webrev/

Thanks,
Sherman

Ulf Zibis wrote:
> Completed ...:
>
>
> *** Decoder-Suggestions:
>
> (10) Split map data files into chunks and load lazy.
>    TW native speakers must be consulted, to define reasonable chunks!
>    Benefit[17]: save startup time
>    Benefit[18]: save memory
>
> (11) Use java.util.BitSet for b2cIsSupp
>    Benefit[19]: save memory, maybe faster
>
>
> *** Encoder-Suggestions:
>
> (21) Initialize encoder mappings lazy, maybe split into reasonable 
> chunks:
>    Benefit[21]: increase startup performance for de/encoder
>
> (21) Save c2b and c2bPlane in 2-dimensional array:
>      char[][] c2b = new char[0x100][]
>      only instantiate actually used segments:
>      c2b[x] = new char[0x100]
>    Benefit[22]: save lookup and calculation of index, but add 1 
> indirection
>    Benefit[23]: save range-check for segment index (catch malformed 
> segment index by NPE)
>    Benefit[24]: save c2bIndex
>
> (22) In case of surrogate code points, use high surrogate (8 lower 
> bits) as segment index:
>      char[][] c2bSupp = new char[0x100][]
>      only instantiate actually used segments:
>      c2bSupp[x] = new char[0x400]
>    Benefit[25]: save encoding to UC4 from surrogate pairs (I guess, 
> this would significantly increase performance)
>    Benefit[26]: save lookup and calculation of index, but add 1 
> indirection
>    Benefit[27]: save range-check for segment index (catch malformed 
> segment index by NPE)
>    Benefit[28]: save c2bSuppIndex
>
> (23) Truncate c2b segments:
>      c2b[x] = new char[usedLength]
>      (usedLength values could be generated and saved in EUC_TWMapping 
> or data file)
>    Benefit[29]: avoid superfluous memory and disk-footprint (I guess 
> ~30 %)
>    Benefit[30]: don't range-check in-segment index, catch unmappable 
> index by IndexOutOfBoundsException
>
> (24) Additinally truncate leading unmappables in c2b segments, and 
> host offsets:
>    Benefit[31]: avoid another superfluous memory and disk-footprint (I 
> guess ~10 %)
>    Disadvantage[21]: needs hosting of offsets: 256 bytes
>
> (25) Concerning (23),(24): Check out best segment size (maybe 256 is 
> not optimal):
>    Benefit[32]: avoid another superfluous memory and disk-footprint (I 
> guess 10-20 %)
>
> (26) Concerning (22),(23),(24): maybe use 3-dim array and check out 
> best segment size (maybe 10 bit is not optimal):
>    Benefit[33]: avoid another superfluous memory and disk-footprint (I 
> guess 10-20 %)
>
> (27) Save Plane no. as 0x0, 0x2 .. 0x7 and 0xf:
>    Benefit[34]: simplify calculation of 2nd byte, increases performance
>
> (28) Save 2nd byte in c2bPlane directly (0xa2 .. 0xa7 and 0xaf) 
> instead of Plane no.:
>    Benefit[35]: save calculation of 2nd byte, increases performance
>    Disadvantage[22]: increases c2bPlane by ~73%
>
> -Ulf
>
>
> EUC_TW statistics (updated):
>
> Plane   range   length  segments  segments-usage-ratio
>
> 0    a1a1-fdcb   5868   5d = 93   66 %
> _0   a1a1-a744    434    7 = 7    65 %
> _1   c2a1-fdcb   5434   3c = 60   95 %
>
> 1:8ea2   -f2c4   7650   52 = 82   98 %
> 2:8ea3   -e7aa   6394   47 = 71   95 %
> 3:8ea4   -eedc   7286   4e = 78   98 %
> 4:8ea5   -fcd1   8601   5c = 92   98 %
> 5:8ea6   -e4fa   6385   44 = 68   99 %
> 6:8ea7   -ebd5   6532   4b = 75   98 %
> 7:8eaf   -edb9   8721   4d = 77   92 %
>
> Sum:             55446  262 = 610
>
> max b1 range: 5d = 93
> max b2 range: 5e = 94
>
> memory amount for all segments (not truncated):
> 610 * 94 = 57,340 code points
> truncated -4 % :  ~55,000 code points
> decoder surrogate mapping (*3):  ~165,000 bytes
>
> disk-footprint of EUC_TWMapping (1. Approach from Sherman):
> b2c           : 8 * 94 * 94 * 2.97 = 209,943 Bytes
> b2cIsSuppStr  : 94 * 94 * 1.48     =  13,077
> c2bIndex      : 256 * 7            =   1,792
> c2bSuppIndex  : 256 * 7            =   1,792
> Sum                                 ~227,000 Bytes
>
> memory of EUC_TW (1. Approach from Sherman):
> b2c           : 8 * 94 * 94 * 2 = 141,376 Bytes
> b2cIsSupp     : 94 * 94         =   8,836
> decoder sum   :                   150,212
> c2b           : 31744 * 2       =  63,488
> c2bIndex      : 256 * 2         =     512
> c2bSupp       : 43520 * 2       =  87,040
> c2bSuppIndex  : 256 * 2         =     512
> c2bPlane      : 43520 * 1       =  43,520
> encoder sum   :                   195,072
>
>