Rewrite of IBM doublebyte charsets
Ulf Zibis
Ulf.Zibis at gmx.de
Sun May 17 18:19:28 UTC 2009
Am 14.05.2009 22:55, Xueming Shen schrieb:
> Thanks again for taking time on this. Here is the IBM db charsets webrev
>
> http://cr.openjdk.java.net/~sherman/ibmdb/webrev
>
> This is a bigger fish than the EUC_TW:-)
>
*** Decoder-Suggestions:
(1) Unused imports in DoubleByte-X.java:
import java.util.Arrays;
import sun.nio.cs.StandardCharsets;
import static sun.nio.cs.CharsetMapping.*;
import sun.nio.cs.ext.DoubleByte; // or instead: static
sun.nio.cs.ext.DoubleByte.*;
(2) Please extract de/encoder classes to separate java file:
In tabbed editor it's much more comfortable to select a tab, than
scrolling 760 lines up and down.
DoubleByteXcoder
EBCDICXcoder
DoubleByteOnlyXcoder
EUCSimpleXcoder
(3) Modify dimension of b2c:
char[][] b2c = new char[0x100][segSize];
so decode :
public char decodeDouble(int b1, int b2) {
if ((b2-=b2Min) < 0 || b2 >= segSize)
return UNMAPPABLE_DECODING;
return b2c[b1][b2];
}
Benefit[1]: increase performance of decoder
Benefit[2]: reduce memory of B2C_UNMAPPABLE from 8192 to 512 bytes
Benefit[3]: some of b2c pages could be saved (if only containing \uFFFD)
(4) Don't care about b2Max (it's always not far from 0xff):
Benefit[4]: another performance increase of decoder (only check:
(b2-=b2Min) < 0)
(5) Truncate String segments (there are 65 % "\uFFFD" in IBM933):
(fill b2c segments first with "\uFFFD", then initialize)
Benefit[5]: save up to 180 % superfluous memory and disk-footprint
(6) Unload b2cStr from memory after startup:
- outsource b2cStr to additional class file like EUC_TW approach
- set b2cStr = null after startup (remove final modifier)
Benefit[6]: avoid 100 % superfluous memory-footprint
(7) Avoid copying b2cStr to b2c:
(String#charAt() is fast as char[] access)
Benefit[7]: increase startup performance for decoder
(8) Truncate b2c segments (catch unmappable indexes by RuntimeException):
Benefit[8]: save up to 180 % superfluous memory-footprint
(9) Share mappings (IBM930 and IBM939 are 99 % identical):
Benefit[9]: save up to 99 % superfluous disk-footprint
Benefit[10]: save up to 99 % superfluous memory-footprint (if both
charsets are loaded)
(10) Provide 4-way fork from de/encodeLoop():
See:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/SingleByteEncoder_new.java?rev=&view=markup
Benefit[11]: increase performance, if there is only 1 direct buffer
(11) Quit coders xBufferLoop by exception on xflow:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6806227
Benefit[12]: increase performance
(12) Get rid of sun.io package dependency:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/
Benefit[13]: avoid superfluous disk-footprint
Benefit[14]: save maintenance of sun.io converters
Disadvantage[1]: published under JRL (waiting for launch of OpenJDK-7
project "charset-enhancement") ;-)
(13) Take data files in account _once more_:
Following upper suggestions, data files should be much more smaller
than for EUC_TW,
so loading time from jar by getResourceAsStream() could be
acceptable. If some day
Bug ID 6818736, 6818736 were solved, we could profit once more,
without doing much.
Benefit[15]: avoid 50 % superfluous disk-footprint
Benefit[16]: sharing of map data for different charsets becomes more
simple
(14) Split map data files into chunks and load lazy.
TW native speakers must be consulted, to define reasonable chunks!
Benefit[17]: save startup time
Benefit[18]: save memory
Benefit[19]: sharing of map data becomes much more simple
(15) Diff also IBM1381.java against IBM1383.java and see similarity
(16) decodeArrayLoop: shortcut calculation of limits:
int sl = sp + src.remaining();
int dl = dp + dst.remaining();
(17) Decoder#decodeArrayLoop: shortcut for single byte only:
int sr = src.remaining();
int sl = sp + sr;
int dr = dst.remaining();
int dl = dp + dr;
// single byte only loop
int slSB = sp + sr < dr ? sr : dr;
while (sp < slSB) {
char c = b2cSB[sa[sp] && 0xff];
if (c == UNMAPPABLE_DECODING)
break;
da[dp++] = c;
sp++;
}
Same for Encoder#encodeArrayLoop
(18) Decoder_EBCDIC: boolean singlebyteState:
if (singlebyteState)
...
(19) Decoder_EBCDIC: decode single byte first:
if (singlebyteState)
c = b2cSB[b1];
if (c == UNMAPPABLE_DECODING) {
...
}
Benefit[20]: should be faster
*** Encoder-Suggestions:
(21) join *.nr to *.c2b files (25->000a becomes 000a->fffd):
Benefit[21]: reduce no. of files
Benefit[22]: simplifies initC2B() (avoids 2 loops)
(22) Save c2b in 2-dimensional array:
char[][] c2b = new char[0x100][]
set unused segments to 256-size UNMAPPABLE_ENCODING[]
Benefit[23]: save calculation of index in encodeChar() --> little faster
Benefit[24]: initC2B() becomes faster
- huge c2b[] is initialized twice, 1st with 0 (according JLS) + 2nd
with UNMAPPABLE_ENCODING
- only fill 256 bytes with UNMAPPABLE_ENCODING, and get copies by
Arrays.copyOf()
Benefit[25]: save c2bIndex
(23) Truncate c2b segments:
c2b[x] = new char[usedLength]
(usedLength values could be generated and saved in DoubleByte-X or
data file)
Benefit[26]: avoid superfluous memory and disk-footprint (I guess ~30 %)
Benefit[27]: don't range-check in-segment index, catch unmappable
index by IndexOutOfBoundsException
(24) Additionally truncate leading unmappables in c2b segments, and host
offsets:
Benefit[28]: avoid another superfluous memory and disk-footprint (I
guess ~10 %)
Disadvantage[21]: needs hosting of offsets: 256 bytes
(25) Concerning (23),(24): Check out best segment size (maybe 256 is not
optimal):
Benefit[29]: avoid another superfluous memory and disk-footprint (I
guess 10-20 %)
-Ulf
More information about the core-libs-dev
mailing list