Rewrite of IBM doublebyte charsets

Sun May 10 22:57:26 UTC 2009

Completed ...:

*** Decoder-Suggestions:

(10) Split map data files into chunks and load lazy.
    TW native speakers must be consulted, to define reasonable chunks!
    Benefit[17]: save startup time
    Benefit[18]: save memory

(11) Use java.util.BitSet for b2cIsSupp
    Benefit[19]: save memory, maybe faster

*** Encoder-Suggestions:

(21) Initialize encoder mappings lazy, maybe split into reasonable chunks:
    Benefit[21]: increase startup performance for de/encoder

(21) Save c2b and c2bPlane in 2-dimensional array:
      char[][] c2b = new char[0x100][]
      only instantiate actually used segments:
      c2b[x] = new char[0x100]
    Benefit[22]: save lookup and calculation of index, but add 1 indirection
    Benefit[23]: save range-check for segment index (catch malformed segment index by NPE)
    Benefit[24]: save c2bIndex

(22) In case of surrogate code points, use high surrogate (8 lower bits) as segment index:
      char[][] c2bSupp = new char[0x100][]
      only instantiate actually used segments:
      c2bSupp[x] = new char[0x400]
    Benefit[25]: save encoding to UC4 from surrogate pairs (I guess, this would significantly 
increase performance)
    Benefit[26]: save lookup and calculation of index, but add 1 indirection
    Benefit[27]: save range-check for segment index (catch malformed segment index by NPE)
    Benefit[28]: save c2bSuppIndex

(23) Truncate c2b segments:
      c2b[x] = new char[usedLength]
      (usedLength values could be generated and saved in EUC_TWMapping or data file)
    Benefit[29]: avoid superfluous memory and disk-footprint (I guess ~30 %)
    Benefit[30]: don't range-check in-segment index, catch unmappable index by IndexOutOfBoundsException

(24) Additinally truncate leading unmappables in c2b segments, and host offsets:
    Benefit[31]: avoid another superfluous memory and disk-footprint (I guess ~10 %)
    Disadvantage[21]: needs hosting of offsets: 256 bytes

(25) Concerning (23),(24): Check out best segment size (maybe 256 is not optimal):
    Benefit[32]: avoid another superfluous memory and disk-footprint (I guess 10-20 %)

(26) Concerning (22),(23),(24): maybe use 3-dim array and check out best segment size (maybe 10 bit 
is not optimal):
    Benefit[33]: avoid another superfluous memory and disk-footprint (I guess 10-20 %)

(27) Save Plane no. as 0x0, 0x2 .. 0x7 and 0xf:
    Benefit[34]: simplify calculation of 2nd byte, increases performance

(28) Save 2nd byte in c2bPlane directly (0xa2 .. 0xa7 and 0xaf) instead of Plane no.:
    Benefit[35]: save calculation of 2nd byte, increases performance
    Disadvantage[22]: increases c2bPlane by ~73%

-Ulf

EUC_TW statistics (updated):

Plane   range   length  segments  segments-usage-ratio

0    a1a1-fdcb   5868   5d = 93   66 %
_0   a1a1-a744    434    7 = 7    65 %
_1   c2a1-fdcb   5434   3c = 60   95 %

1:8ea2   -f2c4   7650   52 = 82   98 %
2:8ea3   -e7aa   6394   47 = 71   95 %
3:8ea4   -eedc   7286   4e = 78   98 %
4:8ea5   -fcd1   8601   5c = 92   98 %
5:8ea6   -e4fa   6385   44 = 68   99 %
6:8ea7   -ebd5   6532   4b = 75   98 %
7:8eaf   -edb9   8721   4d = 77   92 %

Sum:             55446  262 = 610

max b1 range: 5d = 93
max b2 range: 5e = 94

memory amount for all segments (not truncated):
610 * 94 = 57,340 code points
truncated -4 % :  ~55,000 code points
decoder surrogate mapping (*3):  ~165,000 bytes

disk-footprint of EUC_TWMapping (1. Approach from Sherman):
b2c           : 8 * 94 * 94 * 2.97 = 209,943 Bytes
b2cIsSuppStr  : 94 * 94 * 1.48     =  13,077
c2bIndex      : 256 * 7            =   1,792
c2bSuppIndex  : 256 * 7            =   1,792
Sum                                 ~227,000 Bytes

memory of EUC_TW (1. Approach from Sherman):
b2c           : 8 * 94 * 94 * 2 = 141,376 Bytes
b2cIsSupp     : 94 * 94         =   8,836
decoder sum   :                   150,212
c2b           : 31744 * 2       =  63,488
c2bIndex      : 256 * 2         =     512
c2bSupp       : 43520 * 2       =  87,040
c2bSuppIndex  : 256 * 2         =     512
c2bPlane      : 43520 * 1       =  43,520
encoder sum   :                   195,072