Rewrite of IBM doublebyte charsets

Sat May 9 17:50:39 UTC 2009

Am 01.05.2009 08:48, Xueming Shen schrieb:
> Hi,
>
> While I'm waiting for Alan's code-review result for my rewriting of 
> EUC_TW
>    http://cr.openjdk.java.net/~sherman/6831794_6229811/webrev
> (much faster, much smaller, near 8% decrease of size of charsets.jar 
> with one
> charset update. OK, it's a shame...I meant the old data structure)

EUC_TW statistics:

Plane   range   length  segments  segments-usage-ratio

 0    a1a1-fdcb   5868  5d = 93   66 %
 _0   a1a1-a744    434   7 = 7    65 %
 _1   c2a1-fdcb   5434  3c = 60   95 %

 1:8ea2   -f2c4   7650  52 = 82   98 %
 2:8ea3   -e7aa   6394  47 = 71   95 %
 3:8ea4   -eedc   7286  4e = 78   98 %
 4:8ea5   -fcd1   8601  5c = 92   98 %
 5:8ea6   -e4fa   6385  44 = 68   99 %
 6:8ea7   -ebd5   6532  4b = 75   98 %
 7:8eaf   -edb9   8721  4d = 77   92 %

Sum:             55446  262 = 610

memory amount for all segments (not truncated):
610 * 95 = 57950 code points

*** Decoder-Suggestions:

(1) Increase dimension of b2c and decouple plane 0:
      String[] b2c = new String[0x10]
      String b2c_0 = ...
    Benefit[1]: save calculation of plane no. to range 0..7 (but mask by 
0xa0)
    Benefit[2]: save range-check for plane (catch malformed plane by NPE)
    sophisticated (additionally save masking of plane no.):
      String[] b2c = new String[0xb0]

(2) Save Strings in 2-dimensional array:
      String[][] b2c = new String[0x10][]
      String[] b2c_0 = new String[0x5d]
      b2c[0x2] = new String[0x52]
      b2c[0x3] = new String[0x47]
      b2c[0x4] = new String[0x4e]
      b2c[0x5] = new String[0x5c]
      b2c[0x6] = new String[0x44]
      b2c[0x7] = new String[0x4b]
      b2c[0xf] = new String[0x4d]
    sophisticated (segments a8..c1 are unused in plane 0):
      String[] b2c_0 = new String[0x07]
      String[] b2c_1 = new String[0x3c]
    Benefit[3]: save calculation of index (multiplying with dbSegSize), 
but add 1 indirection
    Benefit[4]: save range-check for segment index (catch malformed 
segment index by NPE)
    Benefit[5]: save range-check for String index (catch malformed 
String indexes by IndexOutOfBoundsException)
    Benefit[6]: avoid 22 % superfluous memory and disk-footprint

(3) Truncate Strings (catch unmappable String indexes by 
IndexOutOfBoundsException):
    Benefit[7]: save another 4 % superfluous memory and disk-footprint

Note: All exceptions can be catched at once, as they are all of 
RuntimeException.

(4) Save mappings in data file (modified UTF-8-saved chars need 2.97 
bytes in average):
    Benefit[8]: save modified UTF-8 decoding while loading class file
    Benefit[9]: avoid another 48 % superfluous disk-footprint
    Note: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795536
    ( I have just created patch, but I'm waiting for launch of OpenJDK-7 
project "charset-enhancement".)
    Disadvantage[1]: loading data from jar-file may be slow, but ...
    - host data file outside of jar, as loading by 
nio.channel.FileChannel from direct buffer should be fast
    - resolve Bugs:
      http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6818736
      http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6818737

(5) Generate mappings as surrogate pairs:
    High surrogates could be saved as bytes and ANDed by 0xdc00, as they 
won't exceed 0xdc80
    Benefit[10]: save decoding to surrogate pairs (I guess, this would 
significantly increase performance)
    Benefit[11]: save b2cIsSupp[] (saves another 4 % memory and 
disk-footprint)
    Disadvantage[2]: memory and disk-footprint would again increase by 50 %

(6) Change parameters of decode() method:
    static void decode(byte[] src, char[] dst, int sp, int sl, int dp, 
int dl, int p) ("beta" approach)
    speads up buffer access + avoids c1, c1 buffering
    Benefit[12]: increase performance
    Disadvantage[3]: need different methods for direct buffers

(7) Provide 4-way fork from de/encodeLoop():
    See: 
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/src/sun/nio/cs/SingleByteEncoder_new.java?rev=&view=markup
    Benefit[13]: increase performance, if there is only 1 direct buffer

(8) Quit coders xBufferLoop by exception on xflow:
    http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6806227
    Benefit[14]: increase performance

(9) Get rid of sun.io package dependency:

https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/
    Benefit[15]: avoid superfluous disk-footprint
    Benefit[16]: save maintenance of sun.io converters
    Disadvantage[4]: published under JRL (waiting for launch of 
OpenJDK-7 project "charset-enhancement") ;-)

*** Encoder-Suggestions (not complete, just some thoughts):

(11) Initialize encoder mappings lazy:
    Benefit[17]: increase startup performance for decoder

(12) Generate mappings for surrogate pairs:
    Benefit[18]: save encoding from surrogate pairs (I guess, this would 
significantly increase performance)

(13) Introduce 16-bit intermediate mapping ("beta"-thoughts: overall 
count of code points is < 65536):
    Benefit[19]: avoid superfluous memory and disk-footprint

-Ulf