Rewrite of IBM doublebyte charsets
Ulf Zibis
Ulf.Zibis at gmx.de
Sat May 9 17:50:39 UTC 2009
Am 01.05.2009 08:48, Xueming Shen schrieb:
> Hi,
>
> While I'm waiting for Alan's code-review result for my rewriting of
> EUC_TW
> http://cr.openjdk.java.net/~sherman/6831794_6229811/webrev
> (much faster, much smaller, near 8% decrease of size of charsets.jar
> with one
> charset update. OK, it's a shame...I meant the old data structure)
EUC_TW statistics:
Plane range length segments segments-usage-ratio
0 a1a1-fdcb 5868 5d = 93 66 %
_0 a1a1-a744 434 7 = 7 65 %
_1 c2a1-fdcb 5434 3c = 60 95 %
1:8ea2 -f2c4 7650 52 = 82 98 %
2:8ea3 -e7aa 6394 47 = 71 95 %
3:8ea4 -eedc 7286 4e = 78 98 %
4:8ea5 -fcd1 8601 5c = 92 98 %
5:8ea6 -e4fa 6385 44 = 68 99 %
6:8ea7 -ebd5 6532 4b = 75 98 %
7:8eaf -edb9 8721 4d = 77 92 %
Sum: 55446 262 = 610
memory amount for all segments (not truncated):
610 * 95 = 57950 code points
*** Decoder-Suggestions:
(1) Increase dimension of b2c and decouple plane 0:
String[] b2c = new String[0x10]
String b2c_0 = ...
Benefit[1]: save calculation of plane no. to range 0..7 (but mask by
0xa0)
Benefit[2]: save range-check for plane (catch malformed plane by NPE)
sophisticated (additionally save masking of plane no.):
String[] b2c = new String[0xb0]
(2) Save Strings in 2-dimensional array:
String[][] b2c = new String[0x10][]
String[] b2c_0 = new String[0x5d]
b2c[0x2] = new String[0x52]
b2c[0x3] = new String[0x47]
b2c[0x4] = new String[0x4e]
b2c[0x5] = new String[0x5c]
b2c[0x6] = new String[0x44]
b2c[0x7] = new String[0x4b]
b2c[0xf] = new String[0x4d]
sophisticated (segments a8..c1 are unused in plane 0):
String[] b2c_0 = new String[0x07]
String[] b2c_1 = new String[0x3c]
Benefit[3]: save calculation of index (multiplying with dbSegSize),
but add 1 indirection
Benefit[4]: save range-check for segment index (catch malformed
segment index by NPE)
Benefit[5]: save range-check for String index (catch malformed
String indexes by IndexOutOfBoundsException)
Benefit[6]: avoid 22 % superfluous memory and disk-footprint
(3) Truncate Strings (catch unmappable String indexes by
IndexOutOfBoundsException):
Benefit[7]: save another 4 % superfluous memory and disk-footprint
Note: All exceptions can be catched at once, as they are all of
RuntimeException.
(4) Save mappings in data file (modified UTF-8-saved chars need 2.97
bytes in average):
Benefit[8]: save modified UTF-8 decoding while loading class file
Benefit[9]: avoid another 48 % superfluous disk-footprint
Note: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795536
( I have just created patch, but I'm waiting for launch of OpenJDK-7
project "charset-enhancement".)
Disadvantage[1]: loading data from jar-file may be slow, but ...
- host data file outside of jar, as loading by
nio.channel.FileChannel from direct buffer should be fast
- resolve Bugs:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6818736
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6818737
(5) Generate mappings as surrogate pairs:
High surrogates could be saved as bytes and ANDed by 0xdc00, as they
won't exceed 0xdc80
Benefit[10]: save decoding to surrogate pairs (I guess, this would
significantly increase performance)
Benefit[11]: save b2cIsSupp[] (saves another 4 % memory and
disk-footprint)
Disadvantage[2]: memory and disk-footprint would again increase by 50 %
(6) Change parameters of decode() method:
static void decode(byte[] src, char[] dst, int sp, int sl, int dp,
int dl, int p) ("beta" approach)
speads up buffer access + avoids c1, c1 buffering
Benefit[12]: increase performance
Disadvantage[3]: need different methods for direct buffers
(7) Provide 4-way fork from de/encodeLoop():
See:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/SingleByteEncoder_new.java?rev=&view=markup
Benefit[13]: increase performance, if there is only 1 direct buffer
(8) Quit coders xBufferLoop by exception on xflow:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6806227
Benefit[14]: increase performance
(9) Get rid of sun.io package dependency:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/
Benefit[15]: avoid superfluous disk-footprint
Benefit[16]: save maintenance of sun.io converters
Disadvantage[4]: published under JRL (waiting for launch of
OpenJDK-7 project "charset-enhancement") ;-)
*** Encoder-Suggestions (not complete, just some thoughts):
(11) Initialize encoder mappings lazy:
Benefit[17]: increase startup performance for decoder
(12) Generate mappings for surrogate pairs:
Benefit[18]: save encoding from surrogate pairs (I guess, this would
significantly increase performance)
(13) Introduce 16-bit intermediate mapping ("beta"-thoughts: overall
count of code points is < 65536):
Benefit[19]: avoid superfluous memory and disk-footprint
-Ulf
More information about the core-libs-dev
mailing list