Rewrite of IBM doublebyte charsets

Sun May 17 18:19:28 UTC 2009

Am 14.05.2009 22:55, Xueming Shen schrieb:
> Thanks again for taking time on this. Here is the IBM db charsets webrev
>
> http://cr.openjdk.java.net/~sherman/ibmdb/webrev
>
> This is a bigger fish than the EUC_TW:-)
>

*** Decoder-Suggestions:

(1) Unused imports in DoubleByte-X.java:
    import java.util.Arrays;
    import sun.nio.cs.StandardCharsets;
    import static sun.nio.cs.CharsetMapping.*;
    import sun.nio.cs.ext.DoubleByte;  // or instead: static 
sun.nio.cs.ext.DoubleByte.*;

(2) Please extract de/encoder classes to separate java file:
    In tabbed editor it's much more comfortable to select a tab, than 
scrolling 760 lines up and down.
      DoubleByteXcoder
      EBCDICXcoder
      DoubleByteOnlyXcoder
      EUCSimpleXcoder

(3) Modify dimension of b2c:
      char[][] b2c = new char[0x100][segSize];
    so decode :
      public char decodeDouble(int b1, int b2) {
          if ((b2-=b2Min) < 0 || b2 >= segSize)
              return UNMAPPABLE_DECODING;
          return b2c[b1][b2];
      }
   Benefit[1]: increase performance of decoder
   Benefit[2]: reduce memory of B2C_UNMAPPABLE from 8192 to 512 bytes
   Benefit[3]: some of b2c pages could be saved (if only containing \uFFFD)

(4) Don't care about b2Max (it's always not far from 0xff):
   Benefit[4]: another performance increase of decoder (only check: 
(b2-=b2Min) < 0)

(5) Truncate String segments (there are 65 % "\uFFFD" in IBM933):
    (fill b2c segments first with "\uFFFD", then initialize)
   Benefit[5]: save up to 180 % superfluous memory and disk-footprint

(6) Unload b2cStr from memory after startup:
    - outsource b2cStr to additional class file like EUC_TW approach
    - set b2cStr = null after startup (remove final modifier)
   Benefit[6]: avoid 100 % superfluous memory-footprint

(7) Avoid copying b2cStr to b2c:
    (String#charAt() is fast as char[] access)
   Benefit[7]: increase startup performance for decoder

(8) Truncate b2c segments (catch unmappable indexes by RuntimeException):
   Benefit[8]: save up to 180 % superfluous memory-footprint

(9) Share mappings (IBM930 and IBM939 are 99 % identical):
   Benefit[9]: save up to 99 % superfluous disk-footprint
   Benefit[10]: save up to 99 % superfluous memory-footprint (if both 
charsets are loaded)

(10) Provide 4-way fork from de/encodeLoop():
    See:  
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/SingleByteEncoder_new.java?rev=&view=markup
   Benefit[11]: increase performance, if there is only 1 direct buffer

(11) Quit coders xBufferLoop by exception on xflow:
   http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6806227
   Benefit[12]: increase performance

(12) Get rid of sun.io package dependency:

https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/
   Benefit[13]: avoid superfluous disk-footprint
   Benefit[14]: save maintenance of sun.io converters
   Disadvantage[1]: published under JRL (waiting for launch of OpenJDK-7 
project "charset-enhancement") ;-)

(13) Take data files in account _once more_:
   Following upper suggestions, data files should be much more smaller 
than for EUC_TW,
   so loading time from jar by getResourceAsStream() could be 
acceptable. If some day
   Bug ID 6818736, 6818736 were solved, we could profit once more, 
without doing much.
   Benefit[15]: avoid 50 % superfluous disk-footprint
   Benefit[16]: sharing of map data for different charsets becomes more 
simple

(14) Split map data files into chunks and load lazy.
   TW native speakers must be consulted, to define reasonable chunks!
   Benefit[17]: save startup time
   Benefit[18]: save memory
   Benefit[19]: sharing of map data becomes much more simple

(15) Diff also IBM1381.java against IBM1383.java and see similarity

(16) decodeArrayLoop: shortcut calculation of limits:
      int sl = sp + src.remaining();
      int dl = dp + dst.remaining();

(17) Decoder#decodeArrayLoop: shortcut for single byte only:
      int sr = src.remaining();
      int sl = sp + sr;
      int dr = dst.remaining();
      int dl = dp + dr;
      // single byte only loop
      int slSB = sp + sr < dr ? sr : dr;
      while (sp < slSB) {
          char c = b2cSB[sa[sp] && 0xff];
          if (c == UNMAPPABLE_DECODING)
              break;
          da[dp++] = c;
          sp++;
      }
     Same for Encoder#encodeArrayLoop

(18) Decoder_EBCDIC: boolean singlebyteState:
      if (singlebyteState)
          ...

(19) Decoder_EBCDIC: decode single byte first:
      if (singlebyteState)
          c = b2cSB[b1];
      if (c == UNMAPPABLE_DECODING) {
          ...
      }
   Benefit[20]: should be faster

*** Encoder-Suggestions:

(21) join *.nr to *.c2b files (25->000a becomes 000a->fffd):
   Benefit[21]: reduce no. of files
   Benefit[22]: simplifies initC2B() (avoids 2 loops)

(22) Save c2b in 2-dimensional array:
     char[][] c2b = new char[0x100][]
     set unused segments to 256-size UNMAPPABLE_ENCODING[]
   Benefit[23]: save calculation of index in encodeChar() --> little faster
   Benefit[24]: initC2B() becomes faster
   - huge c2b[] is initialized twice, 1st with 0 (according JLS) + 2nd 
with UNMAPPABLE_ENCODING
   - only fill 256 bytes with UNMAPPABLE_ENCODING, and get copies by 
Arrays.copyOf()
   Benefit[25]: save c2bIndex

(23) Truncate c2b segments:
     c2b[x] = new char[usedLength]
     (usedLength values could be generated and saved in DoubleByte-X or 
data file)
   Benefit[26]: avoid superfluous memory and disk-footprint (I guess ~30 %)
   Benefit[27]: don't range-check in-segment index, catch unmappable 
index by IndexOutOfBoundsException

(24) Additionally truncate leading unmappables in c2b segments, and host 
offsets:
   Benefit[28]: avoid another superfluous memory and disk-footprint (I 
guess ~10 %)
   Disadvantage[21]: needs hosting of offsets: 256 bytes

(25) Concerning (23),(24): Check out best segment size (maybe 256 is not 
optimal):
   Benefit[29]: avoid another superfluous memory and disk-footprint (I 
guess 10-20 %)

-Ulf