Rewrite of IBM doublebyte charsets
Ulf Zibis
Ulf.Zibis at gmx.de
Sun May 10 22:57:26 UTC 2009
Completed ...:
*** Decoder-Suggestions:
(10) Split map data files into chunks and load lazy.
TW native speakers must be consulted, to define reasonable chunks!
Benefit[17]: save startup time
Benefit[18]: save memory
(11) Use java.util.BitSet for b2cIsSupp
Benefit[19]: save memory, maybe faster
*** Encoder-Suggestions:
(21) Initialize encoder mappings lazy, maybe split into reasonable chunks:
Benefit[21]: increase startup performance for de/encoder
(21) Save c2b and c2bPlane in 2-dimensional array:
char[][] c2b = new char[0x100][]
only instantiate actually used segments:
c2b[x] = new char[0x100]
Benefit[22]: save lookup and calculation of index, but add 1 indirection
Benefit[23]: save range-check for segment index (catch malformed segment index by NPE)
Benefit[24]: save c2bIndex
(22) In case of surrogate code points, use high surrogate (8 lower bits) as segment index:
char[][] c2bSupp = new char[0x100][]
only instantiate actually used segments:
c2bSupp[x] = new char[0x400]
Benefit[25]: save encoding to UC4 from surrogate pairs (I guess, this would significantly
increase performance)
Benefit[26]: save lookup and calculation of index, but add 1 indirection
Benefit[27]: save range-check for segment index (catch malformed segment index by NPE)
Benefit[28]: save c2bSuppIndex
(23) Truncate c2b segments:
c2b[x] = new char[usedLength]
(usedLength values could be generated and saved in EUC_TWMapping or data file)
Benefit[29]: avoid superfluous memory and disk-footprint (I guess ~30 %)
Benefit[30]: don't range-check in-segment index, catch unmappable index by IndexOutOfBoundsException
(24) Additinally truncate leading unmappables in c2b segments, and host offsets:
Benefit[31]: avoid another superfluous memory and disk-footprint (I guess ~10 %)
Disadvantage[21]: needs hosting of offsets: 256 bytes
(25) Concerning (23),(24): Check out best segment size (maybe 256 is not optimal):
Benefit[32]: avoid another superfluous memory and disk-footprint (I guess 10-20 %)
(26) Concerning (22),(23),(24): maybe use 3-dim array and check out best segment size (maybe 10 bit
is not optimal):
Benefit[33]: avoid another superfluous memory and disk-footprint (I guess 10-20 %)
(27) Save Plane no. as 0x0, 0x2 .. 0x7 and 0xf:
Benefit[34]: simplify calculation of 2nd byte, increases performance
(28) Save 2nd byte in c2bPlane directly (0xa2 .. 0xa7 and 0xaf) instead of Plane no.:
Benefit[35]: save calculation of 2nd byte, increases performance
Disadvantage[22]: increases c2bPlane by ~73%
-Ulf
EUC_TW statistics (updated):
Plane range length segments segments-usage-ratio
0 a1a1-fdcb 5868 5d = 93 66 %
_0 a1a1-a744 434 7 = 7 65 %
_1 c2a1-fdcb 5434 3c = 60 95 %
1:8ea2 -f2c4 7650 52 = 82 98 %
2:8ea3 -e7aa 6394 47 = 71 95 %
3:8ea4 -eedc 7286 4e = 78 98 %
4:8ea5 -fcd1 8601 5c = 92 98 %
5:8ea6 -e4fa 6385 44 = 68 99 %
6:8ea7 -ebd5 6532 4b = 75 98 %
7:8eaf -edb9 8721 4d = 77 92 %
Sum: 55446 262 = 610
max b1 range: 5d = 93
max b2 range: 5e = 94
memory amount for all segments (not truncated):
610 * 94 = 57,340 code points
truncated -4 % : ~55,000 code points
decoder surrogate mapping (*3): ~165,000 bytes
disk-footprint of EUC_TWMapping (1. Approach from Sherman):
b2c : 8 * 94 * 94 * 2.97 = 209,943 Bytes
b2cIsSuppStr : 94 * 94 * 1.48 = 13,077
c2bIndex : 256 * 7 = 1,792
c2bSuppIndex : 256 * 7 = 1,792
Sum ~227,000 Bytes
memory of EUC_TW (1. Approach from Sherman):
b2c : 8 * 94 * 94 * 2 = 141,376 Bytes
b2cIsSupp : 94 * 94 = 8,836
decoder sum : 150,212
c2b : 31744 * 2 = 63,488
c2bIndex : 256 * 2 = 512
c2bSupp : 43520 * 2 = 87,040
c2bSuppIndex : 256 * 2 = 512
c2bPlane : 43520 * 1 = 43,520
encoder sum : 195,072
More information about the core-libs-dev
mailing list